From e-just at northwestern.edu Thu Jul 8 14:09:32 2004 From: e-just at northwestern.edu (Eric Just) Date: Thu Jul 8 14:12:00 2004 Subject: [DAS] Errno Message-ID: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu> Hi, I've downloaded biodas (to play with querying my DAS enable GBrowse server). There was a fatal error getting this code to run on windows.. in Bio::Das::HTTP::Fetch it requiress that Errno export 'EINPROGRESS' and 'EWOULDBLOCK'. It seems the ActiveState version of windows does not export these. I Kludged a fix by commenting out the 'use Errrno' line and all lines that refer to 'EINPROGRESS' or 'EWOULDBLOCK'. Admittedly a poor solution but I don't know too much about sockets and the types of errors that they throw. I'd be happy to help fix this in windows 'properly'. After this fix, the test works (01das.t) and my really basic script works. Thanks for the efforts, i think this is going to be really great to work with our DAS server. Eric ============================================ Eric Just e-just@northwestern.edu dictyBase Programmer Center for Genetic Medicine Northwestern University http://dictybase.org ============================================ From lstein at cshl.edu Thu Jul 8 14:25:20 2004 From: lstein at cshl.edu (Lincoln Stein) Date: Thu Jul 8 14:27:32 2004 Subject: [DAS] Errno In-Reply-To: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu> References: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu> Message-ID: <200407081425.20106.lstein@cshl.edu> Hi Eric, Very bad on my part not to pick up on that. I'll just hardcode the error numbers, which don't change from system to system. Lincoln On Thursday 08 July 2004 02:09 pm, Eric Just wrote: > Hi, > I've downloaded biodas (to play with querying my DAS enable GBrowse > server). There was a fatal error getting this code to run on > windows.. > > in Bio::Das::HTTP::Fetch it requiress that Errno export > 'EINPROGRESS' and 'EWOULDBLOCK'. > It seems the ActiveState version of windows does not export these. > I Kludged a fix by commenting out the 'use Errrno' line and all > lines that refer to 'EINPROGRESS' or 'EWOULDBLOCK'. Admittedly a > poor solution but I don't know too much about sockets and the types > of errors that they throw. I'd be happy to help fix this in windows > 'properly'. > > After this fix, the test works (01das.t) and my really basic script > works. > > > Thanks for the efforts, i think this is going to be really great to > work with our DAS server. > > Eric > > ============================================ > > Eric Just > e-just@northwestern.edu > dictyBase Programmer > Center for Genetic Medicine > Northwestern University > http://dictybase.org > > ============================================ > > _______________________________________________ > DAS mailing list > DAS@biodas.org > http://biodas.org/mailman/listinfo/das -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 From e-just at northwestern.edu Thu Jul 8 15:30:44 2004 From: e-just at northwestern.edu (Eric Just) Date: Thu Jul 8 15:33:04 2004 Subject: [DAS] Bio::Das::SegmentI methods Message-ID: <5.1.1.6.0.20040708143038.03383f70@hecky.it.northwestern.edu> Hi, its me again I found another issue ( probably a known issue ). The Bio::Das::SegmentI interface defines overlapping_features contained_features contained_in methods. These do not seem to work with the Bio::Das object (using with GBrowse). It seems that the rangetype argument gets passed to Bio::Das->features method but this method does not do anything with the rangetype argument. I can assist in coding this functionality if it is not already planned or there are other issues. It also seems that the methods themselves have bugs: each one has lines: my @args = $_[0] !~ /^-/ ? (@_, -rangetype=>'overlaps') : (-types=>\@_,-rangetype=>'overlaps'); I think it should be my @args = $_[0] =~ /^-/ ? (@_, -rangetype=>'overlaps') : (-types=>\@_,-rangetype=>'overlaps'); so that you are passing in the whole hash if you match /^-/ I don't know if this bug should actually go to bioperl, if so I can post it on their bugzilla. Thanks again, Eric ============================================ Eric Just e-just@northwestern.edu dictyBase Programmer Center for Genetic Medicine Northwestern University http://dictybase.org ============================================ From lstein at cshl.edu Thu Jul 8 18:02:43 2004 From: lstein at cshl.edu (Lincoln Stein) Date: Thu Jul 8 20:38:27 2004 Subject: [DAS] Errno In-Reply-To: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu> References: <5.1.1.6.0.20040708125910.02a74aa0@hecky.it.northwestern.edu> Message-ID: <200407081802.43219.lstein@cshl.edu> Hi Eric, Give this beta version a try. Lincoln On Thursday 08 July 2004 02:09 pm, Eric Just wrote: > Hi, > I've downloaded biodas (to play with querying my DAS enable GBrowse > server). There was a fatal error getting this code to run on > windows.. > > in Bio::Das::HTTP::Fetch it requiress that Errno export > 'EINPROGRESS' and 'EWOULDBLOCK'. > It seems the ActiveState version of windows does not export these. > I Kludged a fix by commenting out the 'use Errrno' line and all > lines that refer to 'EINPROGRESS' or 'EWOULDBLOCK'. Admittedly a > poor solution but I don't know too much about sockets and the types > of errors that they throw. I'd be happy to help fix this in windows > 'properly'. > > After this fix, the test works (01das.t) and my really basic script > works. > > > Thanks for the efforts, i think this is going to be really great to > work with our DAS server. > > Eric > > ============================================ > > Eric Just > e-just@northwestern.edu > dictyBase Programmer > Center for Genetic Medicine > Northwestern University > http://dictybase.org > > ============================================ > > _______________________________________________ > DAS mailing list > DAS@biodas.org > http://biodas.org/mailman/listinfo/das -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 -------------- next part -------------- A non-text attachment was scrubbed... Name: Bio-Das-1.00.tar.gz Type: application/x-tgz Size: 125064 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/das/attachments/20040708/99b024e3/Bio-Das-1.00.tar-0001.bin From maximilianh at gmx.de Fri Jul 9 17:39:06 2004 From: maximilianh at gmx.de (Maximilian Haeussler) Date: Sun Jul 11 22:10:14 2004 Subject: [DAS] retrieve genes by name Message-ID: <40EF107A.4010506@gmx.de> Hi, I'm a complete newbie to DAS and couldn't find documentation on this issue, so I hope you can help me: 1) In june 03 there was a discussion on this list started by Ethan Cerami (http://portal.open-bio.org/pipermail/das/2003-January/000647.html) about finding a gene by it's (hugo?) name and retrieving the sequence. I didn't completely understand it, but from what I've understood, retrieving a CDS was not that straigforward. Did it get anything easier in the meantime? 2) I am trying to retrieve genes by locuslink/HUGO or any other IDs from biojava and get their 5' sequence. Could you point me to some documentation that describes this task? Of course, the best would be some "biojava in anger"-style cookbook-like recipe on the internet, but any kind of keyword is appreciated. Yes, there is the DAS client in biojava, but it does not seem to support gene names. Or am I off the track here, is DAS simply not meant to support searches like this directly? Thanks in advance Max From project4.bioinformatics at erasmusmc.nl Mon Jul 12 04:46:00 2004 From: project4.bioinformatics at erasmusmc.nl (Selmar Leeuwenburgh) Date: Mon Jul 12 04:46:02 2004 Subject: [DAS] Question about empty page and dazzlecfg.xml configuration Message-ID: Hi, I have a, probably very easy to solve, problem. I am at the moment trying to install dazzle on tomcat 5.0.25. i am reading the ?setting up a Ensembl DAS Server? from www.ensembl.org/Docs/das_server_v1.2.pdf . I am now on page 6 in the last part of step5. I read there ?if you get an error message or an empty page then check the servlet error log for the source of the problem. 98 % of the problems are related to errors in the configuration of the Dazzle webapp (i.e. In the dazzlecfg)? So i get an directory listing when i typ ?http://localhost:8080/das/? as url in the address bar. Do you know what i need to add or change in the dazzlecfg.xml? With kind regards, Selmar. The current dazzlecfg.xml from my /usr/dazzle/dazzle-webapp-1.01 directory: directory listing of /usr/tomcat/jakarta-tomcat-5.0.25/webapps balancer/ das/ das.war jsp-examples/ ROOT/ servlets-examples/ tomcat-docs/ webdav/ From ak at ebi.ac.uk Mon Jul 12 11:19:03 2004 From: ak at ebi.ac.uk (Andreas Kahari) Date: Mon Jul 12 11:21:12 2004 Subject: [DAS] retrieve genes by name In-Reply-To: <40EF107A.4010506@gmx.de> References: <40EF107A.4010506@gmx.de> Message-ID: <20040712151903.GA10482@ebi.ac.uk> On Fri, Jul 09, 2004 at 11:39:06PM +0200, Maximilian Haeussler wrote: > Hi, > > I'm a complete newbie to DAS and couldn't find documentation on this issue, > so I hope you can help me: > > 1) In june 03 there was a discussion on this list started by Ethan Cerami > (http://portal.open-bio.org/pipermail/das/2003-January/000647.html) about > finding a gene by it's (hugo?) name and retrieving the sequence. I didn't > completely understand it, but from what I've understood, retrieving a CDS > was not that straigforward. Did it get anything easier in the meantime? No, this is not straight forward. The 'ens1834cds' source at das.ensembl.org serves CDS coordinates on Ensmebl peptides, with contigs as entry points. So, http://das.ensembl.org/das/ens1834cds/features?segment=AC105091 will give you things like translation ensembl 55008 55087 - + - ProtView As far as I'm aware, and the Sanger people would be the ones to know with certainty, we currently have no DAS server serving CDS *sequence* directly (even though they they seem to report "dna/1.0" in the X-DAS-Capabilities HTTP header). > 2) I am trying to retrieve genes by locuslink/HUGO or any other IDs from > biojava and get their 5' sequence. Could you point me to some documentation > that describes this task? Of course, the best would be some "biojava in > anger"-style cookbook-like recipe on the internet, but any kind of keyword > is appreciated. Yes, there is the DAS client in biojava, but it does not > seem to support gene names. Or am I off the track here, is DAS simply not > meant to support searches like this directly? First of all, you need a DAS server that understands the IDs you're trying to use. I'm a bit unsure wheather DAS is the right tool here though. Try something like EnsMart instead (http://www.ensembl.org/Multi/martview). For bulk queries, or more complicated stuff, you might want to look into using the BioMart or Ensembl APIs. DAS could be, I think, a bit too simple. BioMart is discussed on the mart-dev (http://www.ebi.ac.uk/biomart/contact.html) list, and Ensembl on the ensembl-dev list (http://www.ensembl.org/Docs/). Regards, Andreas -- |[][]| Andreas K?h?ri EMBL, European Bioinformatics Institute | [] | Wellcome Trust Genome Campus |[][]| Ensembl Developer Hinxton, Cambridgeshire, CB10 1SD | [] | DAS Team Leader United Kingdom From maximilianh at gmx.de Tue Jul 13 07:38:46 2004 From: maximilianh at gmx.de (Maximilian Haeussler) Date: Tue Jul 13 07:39:16 2004 Subject: [DAS] retrieve genes by name References: <40EF107A.4010506@gmx.de> <20040712151903.GA10482@ebi.ac.uk> Message-ID: <40F3C9C6.1050807@gmx.de> > First of all, you need a DAS server that understands the IDs > you're trying to use. I'm a bit unsure wheather DAS is the > right tool here though. Try something like EnsMart instead > (http://www.ensembl.org/Multi/martview). OK, so I won't use DAS, that's nice to know. I couldn't really figure that out from the documentation. > For bulk queries, or more complicated stuff, you might want to > look into using the BioMart or Ensembl APIs. DAS could be, I > think, a bit too simple. BioMart is discussed on the mart-dev > (http://www.ebi.ac.uk/biomart/contact.html) list, and Ensembl on > the ensembl-dev list (http://www.ensembl.org/Docs/). Hum...I'm not sure, but when I use the ensembl apis, won't I miss a couple of model organisms? Arabidopsis, for instance? OK, there is http://atensembl.arabidopsis.info/ which might also be useable with the ensembl apis. So ensembl seems to be the most comprehensive way to go if I want to bulk-download genes of as many organisms as possible... Max From ak at ebi.ac.uk Tue Jul 13 07:55:12 2004 From: ak at ebi.ac.uk (Andreas Kahari) Date: Tue Jul 13 07:57:09 2004 Subject: [DAS] retrieve genes by name In-Reply-To: <40F3C9C6.1050807@gmx.de> References: <40EF107A.4010506@gmx.de> <20040712151903.GA10482@ebi.ac.uk> <40F3C9C6.1050807@gmx.de> Message-ID: <20040713115512.GA18337@ebi.ac.uk> On Tue, Jul 13, 2004 at 01:38:46PM +0200, Maximilian Haeussler wrote: > >First of all, you need a DAS server that understands the IDs > >you're trying to use. I'm a bit unsure wheather DAS is the > >right tool here though. Try something like EnsMart instead > >(http://www.ensembl.org/Multi/martview). > > OK, so I won't use DAS, that's nice to know. I couldn't really figure that > out from the documentation. What documentation would this be? > >For bulk queries, or more complicated stuff, you might want to > >look into using the BioMart or Ensembl APIs. DAS could be, I > >think, a bit too simple. BioMart is discussed on the mart-dev > >(http://www.ebi.ac.uk/biomart/contact.html) list, and Ensembl on > >the ensembl-dev list (http://www.ensembl.org/Docs/). > > Hum...I'm not sure, but when I use the ensembl apis, won't I miss a couple > of model organisms? Arabidopsis, for instance? OK, there is > http://atensembl.arabidopsis.info/ which might also be useable with the > ensembl apis. So ensembl seems to be the most comprehensive way to go if I > want to bulk-download genes of as many organisms as possible... You mentioned HUGO IDs, so I thought you were interested in human genes only. You never mentioned what sources of data you had available, so I picked the solution I had closest at hand. AFAIK, there is no place which has all data for all species in one single format. Ensembl gets close, but we don't do plants. Regards, Andreas -- |=)(=| Andreas K?h?ri EMBL, European Bioinformatics Institute |(==)| Wellcome Trust Genome Campus |=)(=| Ensembl Developer Hinxton, Cambridgeshire, CB10 1SD |(==)| DAS Team Leader United Kingdom From ap3 at sanger.ac.uk Thu Jul 22 08:33:17 2004 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu Jul 22 08:34:55 2004 Subject: [DAS] DAS for protein structures Message-ID: <40FFB40D.1020508@sanger.ac.uk> Hi everybody! I am working together with Thomas Down and Tim Hubbard as part of the eFamily project to extend the DAS protocol towards protein structures. During this work we realised that two new DAS command extensions are required for this: * structure - requests 3D coordinates * alignment - requests a pairwise or multiple alignment of protein structures, sequences, or chromosomes. To read more details please access the specification at http://www.sanger.ac.uk/xml/das/documentation/ Two example requests: http://das.sanger.ac.uk/das/aligpdbsp/alignment?query=1a4a http://das.sanger.ac.uk/das/structure/structure?query=1a4a The extensions allow new clients to be implemented. A prototype of a client for protein structures can be accessed at: http://www.sanger.ac.uk/Users/ap3/DAS/SPICE/stable/spice.html Regards, Andreas -- -------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK From dalke at dalkescientific.com Fri Jul 23 07:20:44 2004 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri Jul 23 07:21:19 2004 Subject: [DAS] DAS for protein structures In-Reply-To: <40FFB40D.1020508@sanger.ac.uk> References: <40FFB40D.1020508@sanger.ac.uk> Message-ID: <56915272-DC9A-11D8-B1D5-000A956826C8@dalkescientific.com> Hi Andreas, Some comments on the proposal > To read more details please access the specification at > http://www.sanger.ac.uk/xml/das/documentation/ > The SEQRES protein sequences, which is contained in a PDB file, can be > different to some extent. Might want to link to the PDB docs for the SEQRES records. You can also find hetereogens (non-ATCG) in the sequence, and it mentions one of my favorite words in the docs - microheterogeneity. > There can be negative positions, the order of the numbers > does not need to be linear, there are alternative locations > possible (indicated by "A", "B"), In this case I suspect the "A" and "B" are insertion codes and not alternate locations. The latter is used when it appears an atom can be in one of multiple positions, as I recall. > All orientation arguments that are used in various services > are becoming optional, since orientation is related to the > orientation along the DNA and is not needed for proteins. Isn't it still required for nucleotides and ignored for protein? Otherwise as you state it the orientation parameter is also optional for DNA. Is "orientation=+" or "orientation=" equivalent to an unspecified orientation parameter when the sequence is a protein? > depreciated "deprecated" > "is re-established again." "is re-established." Unless this is the second time it's been re-established? > "The ref is argument has " "The ref argument has " > It has a version number (required) in the form "N.NN" Define "N.NN". Does this mean there can be only 1000 versions? Why the limit? Why not \d+\.\d+ or \d+(\.\d+)? ? Should there be a meaning to the two parts of the version? Should be always be an increasing value? Isn't the version information captured elsewhere? > Whenever the DNA of the entry point changes, the version > number should change as well. "Should"? Or "must"? The entry_points optional attribute "href" > echoes the URL query that was used to fetch the current document. I don't understand the need for this. If it's important, it won't work in some environments because the client's request might be http://some.host/x/y/z where the machine "some.host" forwards the request to another machine as http://another.host/prefix/x/y/z which does the actual work. The machine "another.host" is on its own local DNS which isn't visible to the outside world. Since the internal machine doesn't know the original URL used by the client it can't pass back a valid URL. > For compatibility with older versions of the specification, the > tag can use a size attribute rather than start and stop, > and can omit the orientation attribute Can "size" be used in addition to start/stop as a transition from the older version to the newer one? If omitted, is the orientation equal to "+"? > This query returns one or all alginments "alignments" Under the XML you have > (required; one only) >The doctype indicates which formal DTD > specification to use. For the dna query, the doctype DTD is > "http://www.biodas.org/dtd/dasdna.dtd". Is that a bad copy&paste from the previous spec? > subject (optional; one or more) the id of the alignment - subject. > To get a list of available alignments for query use the entry_points request. If there is more than one subject, how is the parameter constructed? Is it comma separated? > (required) version of Object. e.g. CRC64 checksum for protein sequences. Why is this version not in the form N.NN? Why is CRC64 suggested? (md5 is better.) Why only for protein sequences? > attribute:intObjectId > > (required) internal, unique name name for this object. This is used in the > SEGMENT section to identify to which object an alignment belongs to. The prefix "int" is confusing. Even "internal" is confusing -- internal to what? What about "sequenceId" since all the objects are sequences? attribute:type > (optional) a type for this object.e.g. DNA, PROTEIN, STRUCTURE, etc. Who defines "etc."? What about "RNA"? "ssRNA"? "tRNA"? Is the case important? The example you give includes Please move "dbAccessionId" to be with the attributeGroup:dbRef terms, to make it easier to compare the outline with the documentation. Could you give snippets from a real example? > "methodName" > attribute:dbCoordSys > (optional). The co-ordinate system used by the database. This > is not always the same as the database. For example, Pfam uses UniProt ... How is this specified? > Clients generally should use the DAS - SEQUENCE request to get the seqeuence, > so this is optional If it's optional then why have it here? As defined, all clients must understand how to get to the DAS - SEQUENCE since they cannot assume the server supports returning the sequence here. And btw, it's "sequence" not "seqeuence". > attribute:property What are the defined property values for an alignObjectDetail? Also, fix up the formatting for this example. Also, "CDATA" refers to unescaped character content while I think you mean "element content". > attribute:methodName > (required) the name of the score, e.g. number of equivlanet > residues (eqr), e-value, etc. what about "scoreType"? Do you have an enumerated list? Are all of the values expected to be a number? If so, is there a restriction to the range of the number? Are IEEE754 exceptional values, like NaN or Inf allowed? > Element: You use the "cigar" string because it provides an "efficient way to encode an alignment" but then you don't provide an efficient way to encode the rotation matrix. Two possibilities are: - it's orthonormal so only include the upper/lower triangle - use comma separated values You don't say if the vector transformation occurs before or after the rotation matrix. Nor do you say which structure gets the transformation, since it only states: this section defines how one of the needs to be shifted and rotated in order to be superimposed with the others. Couldn't you just write this as a (perhaps flattened) homogenous transformation matrix simplified because you know it's only going to be used for rigid body transformations? The result would look like: r11,r12,r13,r22,r23,r33,t1,t2,t3 and be much more succinct than what you have now. Under "Retrieve 3D coordinates". If the chain is not given is it assumed to be equivalent to the chain " "? All PDB residues have a chain, and space is allowed for a chain id. Or does unspecified chain mean get the first chain? Since "one or more" chain ids are allowed, how are the given? Comma separated values? Where do I find the number of models in the structure? According to the docs it implies it can be found from entry_points ("The same applies to a structure server where entry_points returns the list of available chains and models.") I don't see that field described. How do you support the alternate location identifier? Just ignore it? Return all locations for a given atom? Why do you define your own XML format for 3D structure? What about basing it on, say, CML? Or why not just feed a PDB file back, perhaps embedded inside of XML? After all, no structure program is going to handle your XML format. If you do want to roll your own, there are many things to fix. Here are several: > attribute:groupID > > (required) the PDB code of the amino acid. e.g. 25,26,27A > > attribute:insertCode > > (optional) insertion code for amino acid. e.g 86A, 86B Okay, which is the group ID and which is the insertion code? First should be a number (-2, 0, 26) and the insertion code is a character. > > > Two atoms make a connection. Where's the other atomID? Also, in some places you have "Id" (as "dbAccessionId") and in others you have "ID". Are only covalent bonds important? What about HYBND records? You also ignore the anisotropic B-factors and other bits of data which may be in the PDB file. For example, waters on the symmetry axis of a crystal structure may be denoted by an occupancy value of 1/symmetry count. (See the comments for 2PLV.) And you're missing the crystal information. It's 5am here so my apologies if any of the above sounds overly terse or confusing. Andrew dalke@dalkescientific.com From ap3 at sanger.ac.uk Sun Jul 25 13:49:47 2004 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun Jul 25 13:58:34 2004 Subject: [DAS] DAS for protein structures Message-ID: <200407251849.47774.ap3@sanger.ac.uk> Hi Andrew ! Thanks for your detailed feedback. Let me go through the most important issues of your mail: > Why do you define your own XML format for 3D structure? What about > basing it on, say, CML? Or why not just feed a PDB file back, perhaps > embedded inside of XML? * DAS responses consist of XML files that provide a simple format to exchange data. PDB files contain different types of data: biological data about the protein, literature refs, description of the experiment and finally the coordinates. So I would not want to mix DAS-XML and (traditional) PDB files. As you mentioned there are several XML formats for the replacement of PDB files. It does not make sense to invent yet another one to deal with *all* the PDB data. Here the idea is to reduce the PDB file to the minimal data needed for visualization, i.e. coordinates of atoms and their connections. The biological data that is projected onto the 3D structure by a client is retrieved via DAS - Feature and Alignment services. > After all, no structure program is going to > handle your XML format. I guess no structure program is capable of doing ANY - DAS communication at the moment. That's what we try to provide - missing services to apply DAS in the structure world. If you are developing a Java program (I know you are a Python guy, but still ;-) , making it DAS enabled is quite simple. There is support for the new DAS commands in Biojava. e.g.: To get a Biojava structure object via DAS String server = "http://das.sanger.ac.uk/das/structure/structure?query="; DASStructureClient dasc = new DASStructureClient(server); Structure struc = dasc.getStructure(pdbcode); > You use the "cigar" string because it provides an "efficient way to > encode an alignment" but then you don't provide an efficient way to > encode the rotation matrix. Yes, but the matrix does not take much space, so it is not really an issue. An alignment in contrast can be quite big, so the cigar encoding saves a lot of space. > Why is CRC64 suggested? (md5 is better.) This is the checksum provided by Swissprot. > The entry_points optional attribute "href" >> echoes the URL query that was used to fetch the current document. >I don't understand the need for this. same here. It is in the DAS spec. so I kept it. There are a couple of issues with entry_points and proteins anyways. E.g. Swissprot has >150.000 "entry points" ;-) Several other of your issues I will address by improving the docu over the next days. Regards, Andreas -- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK From dalke at dalkescientific.com Sun Jul 25 16:07:10 2004 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun Jul 25 16:07:33 2004 Subject: [DAS] DAS for protein structures In-Reply-To: <200407251849.47774.ap3@sanger.ac.uk> References: <200407251849.47774.ap3@sanger.ac.uk> Message-ID: <363A4690-DE76-11D8-B1D5-000A956826C8@dalkescientific.com> Andreas: > Here the idea is to reduce the PDB file to the minimal data > needed for visualization, i.e. coordinates of atoms and their > connections. > The biological data that is projected onto the 3D structure by a > client is > retrieved via DAS - Feature and Alignment services. What is "the minimal data needed for visualization"? The most terse file format I know is the XYZ format, which has X, Y, Z coordinates and element type. Everything else about the structure can be derived from that either through quantum mechanics or through empirical methods. Humans want more than that, like residue name, chain id, and segment name (I don't think your spec had the last). Some people want to see how the structure fit in the crystal, eg, to see if a given feature is more an aspect of crystal packing forces. Some want the secondary structure annotation information (HELIX and SHEET) while others are just fine with automated means. By saying you're only going to support a subset of what's in the PDB you're saying that those other portions aren't important enough. But they are, or could be for some people and some structures. >> After all, no structure program is going to >> handle your XML format. > > I guess no structure program is capable of doing ANY - DAS > communication at > the moment. That's what we try to provide - missing services to apply > DAS in > the structure world. If you are developing a Java program (I know you > are a > Python guy, but still ;-) , making it DAS enabled is quite simple. > There is > support for the new DAS commands in Biojava. e.g.: But it's a lot easier to get an existing Java structure visualization library to support a PDB file than to support your new format, or your biojava structure object. For example, suppose I want to use Jmol or Marvin as my viewer -- how hard would that be using your API? I see the Biojava structure object supports reading the PDB format but it doesn't capture all of the data so going through it to read the DAS result then generate a PDB formatted string to pass to another library will cause some data loss. There are many sources of data loss. For example, I see you support the x-ray resolution field, but it turns out that the documentation isn't correct. It isn't a simple float because a resolution of "1.20" is different than one of "1.2". There are a few other places like that. And you don't support PDB version 1 files, nor extensions like XPLOR's serial numbering extension where the first digit can roll over to A (as in 99999, A0000, ...) for supporting more than 99999 atoms.) > To get a Biojava structure object via DAS > > String server = > "http://das.sanger.ac.uk/das/structure/structure?query="; > DASStructureClient dasc = new DASStructureClient(server); > Structure struc = dasc.getStructure(pdbcode); Suppose you instead returned HEADER IMMUNOGLOBULIN 16-JAN-92 XXXX TITLE 2.9 ANGSTROMS RESOLUTION STRUCTURE OF AN ANTI-DINITROPHENYL- TITLE 2 SPIN-LABEL MONOCLONAL ANTIBODY FAB FRAGMENT WITH BOUND TITLE 3 HAPTEN ... ATOM ... END The API wouldn't change at all. The implementation would, but not the API. Or suppose you instead used a more ReST-ful format which returns Then that href lookup could be cached, or translated into a local fetch, or pointed to RCSB's PDB server. It could also support things like content negotiation to return a PDB vs. CML vs. other file format, at the desire of the client. (Though con-neg is still more a hope of mine than something actually used.) In any case, the API would be identical to what you propose. The format is just that, a format. There must be something to convert it to a Biojava API whether that format be this new XML one, PDB or mmCIF. You API hides the conversion layer, so it's invisible to the application code no matter the format. >> You use the "cigar" string because it provides an "efficient way to >> encode an alignment" but then you don't provide an efficient way to >> encode the rotation matrix. > > Yes, but the matrix does not take much space, so it is not really an > issue. An > alignment in contrast can be quite big, so the cigar encoding saves a > lot of > space. Then don't even worry about it as a space issue. Just give the 4x4 homogenous transformation matrix. Anyone doing structure work should have libraries for handling coordinate transforms like this, and it's much more elegant than having several different element types (for both the matrix and vector). I'll still argue that you should use a format like m11,m12,m13,m14,m21,m22,m23,m24,m31,m32,m33,m34,m41,m42,m43,m44 rather than It's just so much easier for implementers to read a single vector of numbers into a 4x4 matrix than to read your format. What is your criterion for determining the space vs. implementation costs overhead? Why wouldn't be even more concise and readable? Another option is to consider how the SVG spec handles the same problem, though it is in 2D instead of 3D. Here are a few examples I found: The last is the closest to what I'm proposing. (The earlier ones are harder because the rotation can be around different axes.) That suggests an even nicer encoding as (or use the full 4x4 matrix). Terse, consise, easy to support. What's not to like about it? >> Why is CRC64 suggested? (md5 is better.) > > This is the checksum provided by Swissprot. But why is it suggested? Why not just leave it as > attribute:objectVersion > > (required) version of Object and don't make any recommendation for how to construct the checksum. Better would be to make some functional description of the version, like "must change when the sequence changes" for the weak version you have, or "must be a positive integer which increments when the sequence changes" for a strict version. BTW, as written the objectVersion can be identical to the protein sequence itself. Is there a limit to the size of the version string? The SWISS-PROT record also keeps the timestamp for the last change of the protein sequence. What about using that field instead? Not that I want to mandate that one, but I offer it as another value which meets your spec, and seems more appropriate. Do you know about the Chemistry Development Kit (http://sourceforge.net/projects/cdk/ ) or Joelib (http://www-ra.informatik.uni-tuebingen.de/software/joelib/index.html )? They are two other open-source chemistry libraries for Java and may contain code or techniques you all can draw from. Andrew dalke@dalkescientific.com From ap3 at sanger.ac.uk Mon Jul 26 08:06:37 2004 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon Jul 26 08:08:09 2004 Subject: [DAS] DAS for protein structures In-Reply-To: <363A4690-DE76-11D8-B1D5-000A956826C8@dalkescientific.com> References: <200407251849.47774.ap3@sanger.ac.uk> <363A4690-DE76-11D8-B1D5-000A956826C8@dalkescientific.com> Message-ID: <4104F3CD.3090802@sanger.ac.uk> Andrew: In DAS there are Annotation servers and Reference servers. The structure service is just another type of Reference server. It only needs to serve coordinates. All other data should be provided using other DAS services. This way already existing clients, that cannot do 3D can use the same services and represent data in a 2D way. I do not think that continuing this discussion per email is going to lead anywhere, but I understand you are going to ISMB, so if you or anybody else on this list is interested, we can have a meeting there. Regards, Andreas -- -------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK