From e-just at northwestern.edu  Thu Jul  8 14:09:32 2004
From: e-just at northwestern.edu (Eric Just)
Date: Thu Jul  8 14:12:00 2004
Subject: [DAS] Errno
Message-ID: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu>


Hi,
I've downloaded biodas (to play with querying my DAS enable GBrowse 
server).  There was a fatal error getting this code to run on windows..

in Bio::Das::HTTP::Fetch it requiress that Errno export 'EINPROGRESS' and 
'EWOULDBLOCK'.
It seems the ActiveState version of windows does not export these.  I 
Kludged a fix by commenting out the 'use Errrno' line and all lines that 
refer to 'EINPROGRESS' or 'EWOULDBLOCK'.  Admittedly a poor solution but I 
don't know too much about sockets and the types of errors that they throw. 
I'd be happy to help fix this in windows 'properly'.

After this fix, the test works (01das.t) and my really basic script works.


Thanks for the efforts, i think this is going to be really great to work 
with our DAS server.

Eric

============================================

Eric Just
e-just@northwestern.edu
dictyBase Programmer
Center for Genetic Medicine
Northwestern University
http://dictybase.org

============================================

From lstein at cshl.edu  Thu Jul  8 14:25:20 2004
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu Jul  8 14:27:32 2004
Subject: [DAS] Errno
In-Reply-To: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu>
References: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu>
Message-ID: <200407081425.20106.lstein@cshl.edu>

Hi Eric,

Very bad on my part not to pick up on that.  I'll just hardcode the 
error numbers, which don't change from system to system.

Lincoln

On Thursday 08 July 2004 02:09 pm, Eric Just wrote:
> Hi,
> I've downloaded biodas (to play with querying my DAS enable GBrowse
> server).  There was a fatal error getting this code to run on
> windows..
>
> in Bio::Das::HTTP::Fetch it requiress that Errno export
> 'EINPROGRESS' and 'EWOULDBLOCK'.
> It seems the ActiveState version of windows does not export these. 
> I Kludged a fix by commenting out the 'use Errrno' line and all
> lines that refer to 'EINPROGRESS' or 'EWOULDBLOCK'.  Admittedly a
> poor solution but I don't know too much about sockets and the types
> of errors that they throw. I'd be happy to help fix this in windows
> 'properly'.
>
> After this fix, the test works (01das.t) and my really basic script
> works.
>
>
> Thanks for the efforts, i think this is going to be really great to
> work with our DAS server.
>
> Eric
>
> ============================================
>
> Eric Just
> e-just@northwestern.edu
> dictyBase Programmer
> Center for Genetic Medicine
> Northwestern University
> http://dictybase.org
>
> ============================================
>
> _______________________________________________
> DAS mailing list
> DAS@biodas.org
> http://biodas.org/mailman/listinfo/das

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
From e-just at northwestern.edu  Thu Jul  8 15:30:44 2004
From: e-just at northwestern.edu (Eric Just)
Date: Thu Jul  8 15:33:04 2004
Subject: [DAS] Bio::Das::SegmentI methods
Message-ID: <5.1.1.6.0.20040708143038.03383f70@hecky.it.northwestern.edu>

Hi, its me again

I found another issue ( probably a known issue ).  The Bio::Das::SegmentI 
interface defines

overlapping_features
contained_features
contained_in

methods.  These do not seem to work with the Bio::Das object (using with 
GBrowse).  It seems that the rangetype argument gets passed to 
Bio::Das->features method but this method does not do anything with the 
rangetype argument.  I can assist in coding this functionality if it is not 
already planned or there are other issues.

It also seems that the methods themselves have bugs:

each one has lines:

   my @args = $_[0] !~ /^-/ ? (@_,         -rangetype=>'overlaps')
                            : (-types=>\@_,-rangetype=>'overlaps');


I think it should be

   my @args = $_[0] =~ /^-/ ? (@_,         -rangetype=>'overlaps')
                            : (-types=>\@_,-rangetype=>'overlaps');


so that you are passing in the whole hash if you match /^-/

I don't know if this bug should actually go to bioperl, if so I can post it 
on their bugzilla.

Thanks again,
Eric


============================================

Eric Just
e-just@northwestern.edu
dictyBase Programmer
Center for Genetic Medicine
Northwestern University
http://dictybase.org

============================================

From lstein at cshl.edu  Thu Jul  8 18:02:43 2004
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu Jul  8 20:38:27 2004
Subject: [DAS] Errno
In-Reply-To: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu>
References: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu>
Message-ID: <200407081802.43219.lstein@cshl.edu>

Hi Eric,

Give this beta version a try.

Lincoln

On Thursday 08 July 2004 02:09 pm, Eric Just wrote:
> Hi,
> I've downloaded biodas (to play with querying my DAS enable GBrowse
> server).  There was a fatal error getting this code to run on
> windows..
>
> in Bio::Das::HTTP::Fetch it requiress that Errno export
> 'EINPROGRESS' and 'EWOULDBLOCK'.
> It seems the ActiveState version of windows does not export these. 
> I Kludged a fix by commenting out the 'use Errrno' line and all
> lines that refer to 'EINPROGRESS' or 'EWOULDBLOCK'.  Admittedly a
> poor solution but I don't know too much about sockets and the types
> of errors that they throw. I'd be happy to help fix this in windows
> 'properly'.
>
> After this fix, the test works (01das.t) and my really basic script
> works.
>
>
> Thanks for the efforts, i think this is going to be really great to
> work with our DAS server.
>
> Eric
>
> ============================================
>
> Eric Just
> e-just@northwestern.edu
> dictyBase Programmer
> Center for Genetic Medicine
> Northwestern University
> http://dictybase.org
>
> ============================================
>
> _______________________________________________
> DAS mailing list
> DAS@biodas.org
> http://biodas.org/mailman/listinfo/das

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Bio-Das-1.00.tar.gz
Type: application/x-tgz
Size: 125064 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/das/attachments/20040708/99b024e3/Bio-Das-1.00.tar-0001.bin
From maximilianh at gmx.de  Fri Jul  9 17:39:06 2004
From: maximilianh at gmx.de (Maximilian Haeussler)
Date: Sun Jul 11 22:10:14 2004
Subject: [DAS] retrieve genes by name
Message-ID: <40EF107A.4010506@gmx.de>

Hi,

I'm a complete newbie to DAS and couldn't find documentation on this issue, so I 
hope you can help me:

1) In june 03 there was a discussion on this list started by Ethan Cerami 
(http://portal.open-bio.org/pipermail/das/2003-January/000647.html) about 
finding a gene by it's (hugo?) name and retrieving the sequence. I didn't 
completely understand it, but from what I've understood, retrieving a CDS was 
not that straigforward. Did it get anything easier in the meantime?

2) I am trying to retrieve genes by locuslink/HUGO or any other IDs from biojava 
and get their 5' sequence. Could you point me to some documentation that 
describes this task? Of course, the best would be some "biojava in anger"-style 
cookbook-like recipe on the internet, but any kind of keyword is appreciated. 
Yes, there is the DAS client in biojava, but it does not seem to support gene 
names. Or am I off the track here, is DAS simply not meant to support searches 
like this directly?

Thanks in advance
Max

From project4.bioinformatics at erasmusmc.nl  Mon Jul 12 04:46:00 2004
From: project4.bioinformatics at erasmusmc.nl (Selmar Leeuwenburgh)
Date: Mon Jul 12 04:46:02 2004
Subject: [DAS] Question about empty page and dazzlecfg.xml configuration
Message-ID: <OPEMJHBKHFFAMPBFOGMGIEDHCAAA.project4.bioinformatics@erasmusmc.nl>

Hi,

I have a, probably very easy to solve, problem. I am at the moment
trying to install dazzle on tomcat 5.0.25. i am reading the ?setting up
a Ensembl DAS Server? from www.ensembl.org/Docs/das_server_v1.2.pdf
<http://www.ensembl.org/Docs/das_server_v1.2.pdf>. I am now on page 6 in
the last part of step5. I read there ?if you get an error message or an
empty page then check the servlet error log for the source of the
problem. 98 % of the problems are related to errors in the configuration
of the Dazzle webapp (i.e. In the dazzlecfg)?

So i get an directory listing when i typ ?http://localhost:8080/das/? as
url in the address bar. Do you know what i need to add or change in the
dazzlecfg.xml?

With kind regards,

Selmar.

The current dazzlecfg.xml from my /usr/dazzle/dazzle-webapp-1.01 directory:

<!--
Example configuration file for the Dazzle servlet.

Please check all paths and URIs before deploying this on

your own server.


Information of configuring and deploying Dazzle can

be found at:


http://www.biojava.org/dazzle/


Alternatively, questions can be mailed to:


Thomas Down <td2@sanger.ac.uk>


-->


<dazzle xmlns="http://www.biojava.org/2000/dazzle">

<!-- Test reference server -->


<datasource id="test"
jclass="org.biojava.servlets.dazzle.datasource.EmblDataSource">

<string name="name" value="Test seqs" />

<string name="description" value="Test set for promoter-finding software" />

<string name="version" value="default" />

<string name="fileName" value="test.embl" />


<string name="stylesheet" value="test.style" />

</datasource>


<!-- Test annotation server. Note that the mapMaster property must

be changed to match your reference server -->


<datasource id="tss"
jclass="org.biojava.servlets.dazzle.datasource.GFFAnnotationSource">

<string name="name" value="TSS" />

<string name="description" value="Transcription start sites" />

<string name="version" value="default" />

<string name="fileName" value="fickett-tss.gff" />

<boolean name="dotVersions" value="true" />

<string name="mapMaster" value="http://localhost:8080/das/test/" />


<string name="stylesheet" value="tss.style" />

</datasource>

</dazzle>


directory listing of /usr/tomcat/jakarta-tomcat-5.0.25/webapps

balancer/ das/ das.war jsp-examples/ ROOT/ servlets-examples/
tomcat-docs/ webdav/

From ak at ebi.ac.uk  Mon Jul 12 11:19:03 2004
From: ak at ebi.ac.uk (Andreas Kahari)
Date: Mon Jul 12 11:21:12 2004
Subject: [DAS] retrieve genes by name
In-Reply-To: <40EF107A.4010506@gmx.de>
References: <40EF107A.4010506@gmx.de>
Message-ID: <20040712151903.GA10482@ebi.ac.uk>

On Fri, Jul 09, 2004 at 11:39:06PM +0200, Maximilian Haeussler wrote:
> Hi,
> 
> I'm a complete newbie to DAS and couldn't find documentation on this issue, 
> so I hope you can help me:
> 
> 1) In june 03 there was a discussion on this list started by Ethan Cerami 
> (http://portal.open-bio.org/pipermail/das/2003-January/000647.html) about 
> finding a gene by it's (hugo?) name and retrieving the sequence. I didn't 
> completely understand it, but from what I've understood, retrieving a CDS 
> was not that straigforward. Did it get anything easier in the meantime?

No, this is not straight forward.  The 'ens1834cds' source at
das.ensembl.org serves CDS coordinates on Ensmebl peptides, with
contigs as entry points.

So,
http://das.ensembl.org/das/ens1834cds/features?segment=AC105091
will give you things like

      <FEATURE id="ENSP00000317137-2" label="ENSP00000317137">
        <TYPE id="translation">translation</TYPE>
        <METHOD id="ensembl">ensembl</METHOD>
        <START>55008</START>
        <END>55087</END>

        <SCORE>-</SCORE>
        <ORIENTATION>+</ORIENTATION>
        <PHASE>-</PHASE>
        <GROUP id="translation-ENSP00000317137" type="translation" label="ENSP00000317137">
          <LINK href="http://www.ensembl.org/Homo_sapiens/protview?peptide=ENSP00000317137">ProtView</LINK>
        </GROUP>
      </FEATURE>


As far as I'm aware, and the Sanger people would be the ones to
know with certainty, we currently have no DAS server serving
CDS *sequence* directly (even though they they seem to report
"dna/1.0" in the X-DAS-Capabilities HTTP header).

> 2) I am trying to retrieve genes by locuslink/HUGO or any other IDs from 
> biojava and get their 5' sequence. Could you point me to some documentation 
> that describes this task? Of course, the best would be some "biojava in 
> anger"-style cookbook-like recipe on the internet, but any kind of keyword 
> is appreciated. Yes, there is the DAS client in biojava, but it does not 
> seem to support gene names. Or am I off the track here, is DAS simply not 
> meant to support searches like this directly?

First of all, you need a DAS server that understands the IDs
you're trying to use.  I'm a bit unsure wheather DAS is the
right tool here though.  Try something like EnsMart instead
(http://www.ensembl.org/Multi/martview).

For bulk queries, or more complicated stuff, you might want to
look into using the BioMart or Ensembl APIs.  DAS could be, I
think, a bit too simple.  BioMart is discussed on the mart-dev
(http://www.ebi.ac.uk/biomart/contact.html) list, and Ensembl on
the ensembl-dev list (http://www.ensembl.org/Docs/).


Regards,
Andreas

-- 
|[][]| Andreas K?h?ri      EMBL, European Bioinformatics Institute
| [] |                     Wellcome Trust Genome Campus
|[][]| Ensembl Developer   Hinxton, Cambridgeshire, CB10 1SD
| [] | DAS Team Leader	   United Kingdom
From maximilianh at gmx.de  Tue Jul 13 07:38:46 2004
From: maximilianh at gmx.de (Maximilian Haeussler)
Date: Tue Jul 13 07:39:16 2004
Subject: [DAS] retrieve genes by name
References: <40EF107A.4010506@gmx.de> <20040712151903.GA10482@ebi.ac.uk>
Message-ID: <40F3C9C6.1050807@gmx.de>

> First of all, you need a DAS server that understands the IDs
> you're trying to use.  I'm a bit unsure wheather DAS is the
> right tool here though.  Try something like EnsMart instead
> (http://www.ensembl.org/Multi/martview).
OK, so I won't use DAS, that's nice to know. I couldn't really figure that out 
from the documentation.

> For bulk queries, or more complicated stuff, you might want to
> look into using the BioMart or Ensembl APIs.  DAS could be, I
> think, a bit too simple.  BioMart is discussed on the mart-dev
> (http://www.ebi.ac.uk/biomart/contact.html) list, and Ensembl on
> the ensembl-dev list (http://www.ensembl.org/Docs/).

Hum...I'm not sure, but when I use the ensembl apis, won't I miss a couple of 
model organisms? Arabidopsis, for instance? OK, there is 
http://atensembl.arabidopsis.info/ which might also be useable with the ensembl 
apis. So ensembl seems to be the most comprehensive way to go if I want to 
bulk-download genes of as many organisms as possible...

Max

From ak at ebi.ac.uk  Tue Jul 13 07:55:12 2004
From: ak at ebi.ac.uk (Andreas Kahari)
Date: Tue Jul 13 07:57:09 2004
Subject: [DAS] retrieve genes by name
In-Reply-To: <40F3C9C6.1050807@gmx.de>
References: <40EF107A.4010506@gmx.de> <20040712151903.GA10482@ebi.ac.uk>
	<40F3C9C6.1050807@gmx.de>
Message-ID: <20040713115512.GA18337@ebi.ac.uk>

On Tue, Jul 13, 2004 at 01:38:46PM +0200, Maximilian Haeussler wrote:
> >First of all, you need a DAS server that understands the IDs
> >you're trying to use.  I'm a bit unsure wheather DAS is the
> >right tool here though.  Try something like EnsMart instead
> >(http://www.ensembl.org/Multi/martview).
>
> OK, so I won't use DAS, that's nice to know. I couldn't really figure that 
> out from the documentation.

What documentation would this be?

> >For bulk queries, or more complicated stuff, you might want to
> >look into using the BioMart or Ensembl APIs.  DAS could be, I
> >think, a bit too simple.  BioMart is discussed on the mart-dev
> >(http://www.ebi.ac.uk/biomart/contact.html) list, and Ensembl on
> >the ensembl-dev list (http://www.ensembl.org/Docs/).
> 
> Hum...I'm not sure, but when I use the ensembl apis, won't I miss a couple 
> of model organisms? Arabidopsis, for instance? OK, there is 
> http://atensembl.arabidopsis.info/ which might also be useable with the 
> ensembl apis. So ensembl seems to be the most comprehensive way to go if I 
> want to bulk-download genes of as many organisms as possible...

You mentioned HUGO IDs, so I thought you were interested in
human genes only.  You never mentioned what sources of data you
had available, so I picked the solution I had closest at hand.

AFAIK, there is no place which has all data for all species in
one single format.  Ensembl gets close, but we don't do plants.

Regards,
Andreas


-- 
|=)(=| Andreas K?h?ri      EMBL, European Bioinformatics Institute
|(==)|                     Wellcome Trust Genome Campus
|=)(=| Ensembl Developer   Hinxton, Cambridgeshire, CB10 1SD
|(==)| DAS Team Leader	    United Kingdom
From ap3 at sanger.ac.uk  Thu Jul 22 08:33:17 2004
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Thu Jul 22 08:34:55 2004
Subject: [DAS] DAS for protein structures
Message-ID: <40FFB40D.1020508@sanger.ac.uk>

Hi everybody!

I am working together with Thomas Down and Tim Hubbard as part of the 
eFamily project to extend the DAS protocol towards protein structures. 
During this work we realised that two new DAS command extensions are 
required for this:

* structure   - requests 3D coordinates
* alignment - requests a pairwise or multiple alignment of protein 
structures, sequences, or chromosomes.

To read more details please access the specification at
http://www.sanger.ac.uk/xml/das/documentation/

Two example requests:
http://das.sanger.ac.uk/das/aligpdbsp/alignment?query=1a4a
http://das.sanger.ac.uk/das/structure/structure?query=1a4a

The extensions allow new clients to be implemented. A prototype of a 
client for protein structures can be accessed at:
http://www.sanger.ac.uk/Users/ap3/DAS/SPICE/stable/spice.html


Regards,
Andreas

-- 
--------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                   Hinxton, Cambridge CB10 1SA, UK
 		   

From dalke at dalkescientific.com  Fri Jul 23 07:20:44 2004
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri Jul 23 07:21:19 2004
Subject: [DAS] DAS for protein structures
In-Reply-To: <40FFB40D.1020508@sanger.ac.uk>
References: <40FFB40D.1020508@sanger.ac.uk>
Message-ID: <56915272-DC9A-11D8-B1D5-000A956826C8@dalkescientific.com>

Hi Andreas,

Some comments on the proposal

> To read more details please access the specification at
> http://www.sanger.ac.uk/xml/das/documentation/

 > The SEQRES protein sequences, which is contained in a  PDB file, can 
be
 > different to some extent.

Might want to link to the PDB docs for the SEQRES records.  You
can also find hetereogens (non-ATCG) in the sequence, and it
mentions one of my favorite words in the docs - microheterogeneity.


 >  There can be  negative positions, the order of the numbers
 > does not need to be  linear, there are alternative locations
 > possible (indicated by "A",  "B"),

In this case I suspect the "A" and "B" are insertion codes and
not alternate locations.  The latter is used when it appears
an atom can be in one of multiple positions, as I recall.


 > All orientation arguments that are used in various services
 > are becoming optional, since orientation is related to the
 > orientation along the DNA and is not needed for proteins.

Isn't it still required for nucleotides and ignored for protein?
Otherwise as you state it the orientation parameter is also
optional for DNA.  Is "orientation=+" or "orientation=" equivalent
to an unspecified orientation parameter when the sequence is
a protein?

 > depreciated

"deprecated"

 > "is re-established again."

"is re-established."  Unless this is the second time it's been
re-established?

 > "The ref is argument has "

"The ref argument has "

 > It has a version  number (required) in the form "N.NN"

Define "N.NN".  Does this mean there can be only 1000 versions?
Why the limit?  Why not \d+\.\d+ or \d+(\.\d+)?  ?  Should there
be a meaning to the two parts of the version?  Should be always
be an increasing value?  Isn't the version information captured
elsewhere?

 > Whenever the DNA of the entry point changes, the version
 > number should change as well.

"Should"?  Or "must"?


The entry_points optional attribute "href"
 > echoes the URL query that was used to fetch  the current document.

I don't understand the need for this.  If it's important, it won't
work in some environments because the client's request might be
   http://some.host/x/y/z

where the machine "some.host" forwards the request to another machine as
   http://another.host/prefix/x/y/z

which does the actual work.  The machine "another.host" is on
its own local DNS which isn't visible to the outside world.  Since
the internal machine doesn't know the original URL used by the
client it can't pass back a valid URL.

 >  For compatibility with older versions of the specification, the
 > <SEGMENT>  tag can use a size attribute rather than start and stop,
 > and  can omit the orientation attribute

Can "size" be used in addition to start/stop as a transition from
the older version to the newer one?  If omitted, is the orientation
equal to "+"?

 > This query returns one or all alginments

"alignments"


Under the <dasalignment> XML you have
 > (required; one only)  >The doctype indicates which formal DTD
 > specification to use.  For the dna query, the doctype DTD is
 > "http://www.biodas.org/dtd/dasdna.dtd".

Is that a bad copy&paste from the previous spec?


 > subject (optional; one or more) the id of the alignment - subject.
 > To get a list of available alignments for query use the entry_points 
request.

If there is more than one subject, how is the parameter constructed?
Is it comma separated?

 >  (required) version of Object. e.g. CRC64 checksum for protein 
sequences.

Why is this version not in the form N.NN?

Why is CRC64 suggested?  (md5 is better.)  Why only for protein 
sequences?

 > attribute:intObjectId
 >
 > (required) internal, unique name name for this object.  This is used 
in the
 > SEGMENT section to identify to which object an alignment belongs to.

The prefix "int" is confusing.  Even "internal" is confusing -- internal
to what?  What about "sequenceId" since all the objects are sequences?

attribute:type

 > (optional) a type for this object.e.g. DNA, PROTEIN, STRUCTURE, etc.

Who defines "etc."?  What about "RNA"?  "ssRNA"?  "tRNA"?  Is the
case important?

The example you give includes

<alignment>
   <alignObject dbAccessionId="someid" objectVersion="version"
                intObjectId="internalId" type="objectType" 
dbSource="someSouce"
                dbVersion="version" dbCoordSys="coords"  >
     <alignObjectDetail dbSource="someSouce" property="property">


Please move "dbAccessionId" to be with the attributeGroup:dbRef
terms, to make it easier to compare the outline with the documentation.

Could you give snippets from a real example?

 >        <score methoName="scorename" value="scorevalue">

"methodName"

 > attribute:dbCoordSys
 > (optional). The co-ordinate system used by the database. This
 > is not always the same as the database. For example, Pfam uses 
UniProt ...

How is this specified?


 > Clients generally should use the DAS - SEQUENCE request to get the 
seqeuence,
 > so this is optional

If it's optional then why have it here?  As defined, all clients must 
understand
how to get to the DAS - SEQUENCE since they cannot assume the server 
supports
returning the sequence here.  And btw, it's "sequence" not "seqeuence".


 > attribute:property

What are the defined property values for an alignObjectDetail?  Also, 
fix up
the formatting for this example.  Also, "CDATA" refers to unescaped
character content while I think you mean "element content".

 > attribute:methodName
 > (required) the name of the score, e.g. number of equivlanet
 > residues (eqr), e-value, etc.

what about "scoreType"?  Do you have an enumerated list?  Are all of the
values expected to be a number?  If so, is there a restriction to the
range of the number?  Are IEEE754 exceptional values, like NaN or Inf
allowed?

 > Element:<geo3D>

You use the "cigar" string because it provides an "efficient way to
encode an alignment" but then you don't provide an efficient way to
encode the rotation matrix.  Two possibilities are:
   - it's orthonormal so only include the upper/lower triangle
   - use comma separated values

You don't say if the vector transformation occurs before or after the
rotation matrix.  Nor do you say which structure gets the 
transformation,
since it only states:
     this  section defines how one of the needs to be shifted and rotated
     in order to be superimposed with the others.

Couldn't you just write this as a (perhaps flattened) homogenous
transformation matrix simplified because you know it's only going
to be used for rigid body transformations?

The result would look like:
   <geo3D intObjectId="xxx">r11,r12,r13,r22,r23,r33,t1,t2,t3</geo3D>
and be much more succinct than what you have now.


Under "Retrieve 3D coordinates".

If the chain is not given is it assumed to be equivalent to
the chain " "?  All PDB residues have a chain, and space is allowed
for a chain id.  Or does unspecified chain mean get the first chain?

Since "one or more" chain ids are allowed, how are the given?  Comma
separated values?

Where do I find the number of models in the structure?  According to
the docs it implies it can be found from entry_points ("The same
applies to a  structure server where entry_points returns the list of  
available chains and models.")  I don't see that field described.

How do you support the alternate location identifier?  Just ignore it?
Return all locations for a given atom?

Why do you define your own XML format for 3D structure?  What about
basing it on, say, CML?  Or why not just feed a PDB file back, perhaps
embedded inside of XML?  After all, no structure program is going to
handle your XML format.

If you do want to roll your own, there are many things to fix.  Here
are several:

 > attribute:groupID
 >
 > (required) the PDB code of the amino acid. e.g. 25,26,27A
 >
 > attribute:insertCode
 >
 > (optional) insertion code for amino acid. e.g 86A, 86B

Okay, which is the group ID and which is the insertion code?  First
should be a number (-2, 0, 26) and the insertion code is a
character.

 >         <connect type="connectionType">
 >                <atomid atomID="atomID"/>
 >        </connect>

Two atoms make a connection.  Where's the other atomID?  Also, in
some places you have "Id" (as "dbAccessionId") and in others
you have "ID".

Are only covalent bonds important?  What about HYBND records?

You also ignore the anisotropic B-factors and other bits of data
which may be in the PDB file.  For example, waters on the symmetry
axis of a crystal structure may be denoted by an occupancy value
of 1/symmetry count.  (See the comments for 2PLV.)

And you're missing the crystal information.

It's 5am here so my apologies if any of the above sounds overly
terse or confusing.

					Andrew
					dalke@dalkescientific.com

From ap3 at sanger.ac.uk  Sun Jul 25 13:49:47 2004
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Sun Jul 25 13:58:34 2004
Subject: [DAS] DAS for protein structures
Message-ID: <200407251849.47774.ap3@sanger.ac.uk>

 Hi Andrew !

Thanks for your detailed feedback. Let me go through the most important issues 
of your mail:

> Why do you define your own XML format for 3D structure?  What about
> basing it on, say, CML?  Or why not just feed a PDB file back, perhaps
> embedded inside of XML?  

*  DAS responses consist of XML files that provide a simple format to exchange 
data. PDB files contain different types of data: biological data about the 
protein, literature refs, description of the experiment and finally the 
coordinates. So I would not want to mix DAS-XML and (traditional) PDB files. 
As you mentioned there are several XML formats for the replacement of PDB 
files. It does not make sense to invent yet another one to deal with *all* 
the PDB data.  Here the idea is to reduce the PDB file to the minimal data 
needed for visualization, i.e. coordinates of atoms and their connections. 
The biological data that is projected onto the 3D structure by a client is 
retrieved via DAS - Feature and Alignment services.


> After all, no structure program is going to
> handle your XML format. 

I guess no structure program is capable of doing ANY - DAS communication at 
the moment.  That's what we try to provide - missing services to apply DAS in 
the structure world. If you are developing a Java program  (I know you are a 
Python guy, but still ;-)  , making it DAS enabled  is quite simple. There is 
support for the new  DAS commands in Biojava. e.g.:

To get a Biojava structure object via DAS 
  
String server = "http://das.sanger.ac.uk/das/structure/structure?query=";
DASStructureClient dasc = new DASStructureClient(server);
Structure struc = dasc.getStructure(pdbcode);	    	

> You use the "cigar" string because it provides an "efficient way to
> encode an alignment" but then you don't provide an efficient way to
> encode the rotation matrix.  

Yes, but the matrix does not take much space, so it is not really an issue. An 
alignment in contrast can be quite big, so the cigar encoding saves a lot of 
space.

> Why is CRC64 suggested?  (md5 is better.) 

This is the checksum provided by Swissprot. 

> The entry_points optional attribute "href" 
>> echoes the URL query that was used to fetch  the current document.

>I don't understand the need for this.

same here. It is in the DAS spec. so I kept it. There are a couple of issues 
with entry_points and proteins anyways. E.g. Swissprot has >150.000 "entry 
points" ;-)

Several other of your issues I will address by improving the docu over the 
next days.

Regards,
Andreas

-- 

Andreas Prlic      Wellcome Trust Sanger Institute
                   Hinxton, Cambridge CB10 1SA, UK

From dalke at dalkescientific.com  Sun Jul 25 16:07:10 2004
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Sun Jul 25 16:07:33 2004
Subject: [DAS] DAS for protein structures
In-Reply-To: <200407251849.47774.ap3@sanger.ac.uk>
References: <200407251849.47774.ap3@sanger.ac.uk>
Message-ID: <363A4690-DE76-11D8-B1D5-000A956826C8@dalkescientific.com>

Andreas:
> Here the idea is to reduce the PDB file to the minimal data
> needed for visualization, i.e. coordinates of atoms and their  
> connections.
> The biological data that is projected onto the 3D structure by a  
> client is
> retrieved via DAS - Feature and Alignment services.

What is "the minimal data needed for visualization"?  The most terse
file format I know is the XYZ format, which has X, Y, Z coordinates
and element type.  Everything else about the structure can be
derived from that either through quantum mechanics or through
empirical methods.

Humans want more than that, like residue name, chain id, and segment
name (I don't think your spec had the last).  Some people want
to see how the structure fit in the crystal, eg, to see if a
given feature is more an aspect of crystal packing forces.  Some want
the secondary structure annotation information (HELIX and SHEET)
while others are just fine with automated means.

By saying you're only going to support a subset of what's in the
PDB you're saying that those other portions aren't important enough.
But they are, or could be for some people and some structures.


>> After all, no structure program is going to
>> handle your XML format.
>
> I guess no structure program is capable of doing ANY - DAS  
> communication at
> the moment.  That's what we try to provide - missing services to apply  
> DAS in
> the structure world. If you are developing a Java program  (I know you  
> are a
> Python guy, but still ;-)  , making it DAS enabled  is quite simple.  
> There is
> support for the new  DAS commands in Biojava. e.g.:

But it's a lot easier to get an existing Java structure visualization
library to support a PDB file than to support your new format, or
your biojava structure object.  For example, suppose I want to use
Jmol or Marvin as my viewer -- how hard would that be using your API?

I see the Biojava structure object supports reading the PDB format
but it doesn't capture all of the data so going through it to
read the DAS result then generate a PDB formatted string to pass
to another library will cause some data loss.

There are many sources of data loss.  For example, I see you
support the x-ray resolution field, but it turns out that the
documentation isn't correct.  It isn't a simple float because
a resolution of "1.20" is different than one of "1.2".  There are
a few other places like that.  And you don't support PDB version
1 files, nor extensions like XPLOR's serial numbering extension
where the first digit can roll over to A (as in 99999, A0000, ...)
for supporting more than 99999 atoms.)

> To get a Biojava structure object via DAS
>
> String server =  
> "http://das.sanger.ac.uk/das/structure/structure?query=";
> DASStructureClient dasc = new DASStructureClient(server);
> Structure struc = dasc.getStructure(pdbcode);	

Suppose you instead returned

<structure type="chemical/x-pdb">
HEADER    IMMUNOGLOBULIN                          16-JAN-92   XXXX
TITLE     2.9 ANGSTROMS RESOLUTION STRUCTURE OF AN ANTI-DINITROPHENYL-
TITLE    2 SPIN-LABEL MONOCLONAL ANTIBODY FAB FRAGMENT WITH BOUND
TITLE    3 HAPTEN
  ...
ATOM
  ...
END
</structure>

The API wouldn't change at all.  The implementation would, but
not the API.

Or suppose you instead used a more ReST-ful format which returns

<structure href="some/other/url" />

Then that href lookup could be cached, or translated into a local
fetch, or pointed to RCSB's PDB server.  It could also support
things like content negotiation to return a PDB vs. CML vs. other
file format, at the desire of the client.  (Though con-neg is
still more a hope of mine than something actually used.)

In any case, the API would be identical to what you propose.
The format is just that, a format.  There must be something
to convert it to a Biojava API whether that format be this
new XML one, PDB or mmCIF.  You API hides the conversion layer,
so it's invisible to the application code no matter the format.

>> You use the "cigar" string because it provides an "efficient way to
>> encode an alignment" but then you don't provide an efficient way to
>> encode the rotation matrix.
>
> Yes, but the matrix does not take much space, so it is not really an  
> issue. An
> alignment in contrast can be quite big, so the cigar encoding saves a  
> lot of
> space.

Then don't even worry about it as a space issue.  Just give the
4x4 homogenous transformation matrix.  Anyone doing structure work
should have libraries for handling coordinate transforms like this,
and it's much more elegant than having several different element
types (for both the matrix and vector).

I'll still argue that you should use a format like
<geo3d>m11,m12,m13,m14,m21,m22,m23,m24,m31,m32,m33,m34,m41,m42,m43,m44</ 
geo3d>

rather than

         <geo3D intObjectId="intObjectId">
                 <vector x="xCoord" y="yCoord" z="zCoord"/>
                 <matrix>
                         <max11 coord="float"/>
                         <max12 coord="float"/>
                         <max13 coord="float"/>
                         <max21 coord="float"/>
                         <max22 coord="float"/>
                         <max23 coord="float"/>
                         <max31 coord="float"/>
                         <max32 coord="float"/>
                         <max33 coord="float"/>
                 </matrix>
         </geo3D>

It's just so much easier for implementers to read a single
vector of numbers into a 4x4 matrix than to read your format.

What is your criterion for determining the space vs.
implementation costs overhead?  Why wouldn't

<geo3D intObjectId="intObjectId" x="xCoord" y="yCoord" z="zCoord"
   r11="float" r12="float" r13="float" ... r33="float" />

be even more concise and readable?

Another option is to consider how the SVG spec handles the
same problem, though it is in 2D instead of 3D.  Here are
a few examples I found:

<g transform="translate(-10,-20) scale(2) rotate(45) translate(5,10)">

<g transform="translate(-10,-20)">
   <g transform="scale(2)">
     <g transform="rotate(45)">
       <g transform="translate(5,10)">

<g transform="matrix(1 0 0 1 10 -3)">

The last is the closest to what I'm proposing.  (The earlier
ones are harder because the rotation can be around different axes.)

That suggests an even nicer encoding as

<geo3D intObjectId="intObjectId"
     matrix="r11 r12 r13 r21 r22 r23 r31 r32 r33 t1 t2 t3" />

(or use the full 4x4 matrix).  Terse, consise, easy to support.
What's not to like about it?


>> Why is CRC64 suggested?  (md5 is better.)
>
> This is the checksum provided by Swissprot.

But why is it suggested?  Why not just leave it as

 > attribute:objectVersion
 >
 >  (required) version of Object

and don't make any recommendation for how to construct the
checksum.  Better would be to make some functional description
of the version, like "must change when the sequence changes"
for the weak version you have, or "must be a positive integer
which increments when the sequence changes" for a strict version.

BTW, as written the objectVersion can be identical to the
protein sequence itself.  Is there a limit to the size of
the version string?

The SWISS-PROT record also keeps the timestamp for the
last change of the protein sequence.  What about using that
field instead?  Not that I want to mandate that one, but I
offer it as another value which meets your spec, and seems
more appropriate.

Do you know about the Chemistry Development Kit
(http://sourceforge.net/projects/cdk/ ) or Joelib
(http://www-ra.informatik.uni-tuebingen.de/software/joelib/index.html )?
They are two other open-source chemistry libraries for Java and may
contain code or techniques you all can draw from.

					Andrew
					dalke@dalkescientific.com

From ap3 at sanger.ac.uk  Mon Jul 26 08:06:37 2004
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Mon Jul 26 08:08:09 2004
Subject: [DAS] DAS for protein structures
In-Reply-To: <363A4690-DE76-11D8-B1D5-000A956826C8@dalkescientific.com>
References: <200407251849.47774.ap3@sanger.ac.uk>
	<363A4690-DE76-11D8-B1D5-000A956826C8@dalkescientific.com>
Message-ID: <4104F3CD.3090802@sanger.ac.uk>

Andrew:

In DAS there are Annotation servers and Reference servers. The structure 
service is just another type of Reference server. It only needs to serve 
coordinates. All other data should be provided using other DAS services. 
This way already existing clients, that cannot do 3D can use the same 
services and represent data in a 2D way.

I do not think that continuing this discussion per email is going to 
lead anywhere, but I understand you are going to ISMB, so if you or 
anybody else on this list is interested, we can have a meeting there.

Regards,

Andreas


-- 
--------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                   Hinxton, Cambridge CB10 1SA, UK