From alan.mcculloch at agresearch.co.nz Mon Jun 2 16:56:25 2003 From: alan.mcculloch at agresearch.co.nz (McCulloch, Alan) Date: Sun Jun 1 23:56:34 2003 Subject: [DAS] "XGLT" XML based bioinformatics functional programming language(s) proposal Message-ID: I'm posting this naive proposal for an XML based functional-programming style of bioinformatics language , or collection of languages, to the main open-bio lists I am familiar with to try to find out if anybody else is interested in thinking about a non-naive proposal, or knows somebody who might be , or is already doing so. For very tentative examples of the sort of thing I have in mind, see Examples 1. and 2. below. (For one overview of functional programming languages and XML see for example http://www.xml.com/pub/a/2001/02/14/functional.html) In what follows an XML based functional-programming style of bioinformatics language is referred to as an XGLT (i.e. XSLT-with-a-G, "Genetic Transformation Language", for want of a better term, though its not really related to Genetics specifically, so the G is moot). The main ideas initially are that such a language would * provide a high-level implementation-independent interface to the rich Object Oriented (O-O) libraries (BioJava , BioPERL, BioPython and others), more accessible to non-experts, and to developers working in other environments. XGLT interpreters could be developed using these libraries. * provide an alternative "constructive" way of representing biological sequence and other data. An XGLT based data packet would in general express how to (reconstruct) a given piece of biological sequence data - e.g. a sequence, or a consensus alignment of sequences ,or a translation - rather than convey the data itself, or any particular model of the data. While initially limited to sequence data , it is possible such a functional programming dual may find application to other biological data. Such languages would have the following benefits 1) They would enable reference to and exchange of large complex data structures , such as alignments, in a succinct form, and very suitable for further manipulation. (Example 1 below) 2) Because such languages would in most cases exchange statements about how to (re)construct data , rather than the data itself ,they would convey valuable information lost when only the end results are transmitted - as an example, any indels made in a DNA sequence read as part of its protein translation. (Example 2 below) 3) Such languages could potentially provide a convenient higher-level more declarative style of functional programming interface to Object Oriented libraries , such as BioJava, BioPerl, BioPython and others, as these O-O libraries could be used to write the XGLT engines required to actually interpret and execute XGLT statements. 4) A functional programming style lends itself more readily to expressing a chain of processing steps , i.e. a (mini-) pipeline, than does an Object Oriented system , which is more expressive of static structure.See example 2 below for a very simple/naive example of a micro-translation-pipeline expressed as a nested series of transforms in an XGLT. 5) This point is related to both point 3) and point 4) above. It is likely that one popular method of making Bioinformatic software libraries such as the Bio* projects accessible to the non-expert and/or non-Java/Perl/Python user will be to build Web Services directories (WSDL), with each service mapping to a static Bio* facade method, that internally creates temporary Bio* Objects to execute the service method. However this approach is really limited to one-shot services. Where a task calls for a series of services to be invoked in a pipeline, the fact that the underlying Bio* objects do not persist between calls is a problem , which would require expensive marshalling of output and input between web-service calls. The combination of an XGLT language allowing a non-expert user to specify a nested series of processing steps in a high-level implementation-independent manner, with an XGLT interpreter/engine written using one or more of the well engineered rich O-O-based Bio* libraries, would potentially allow the entire pipeline to be executed within the O-O based engine, with objects persisting as and when required, for the entire pipeline process. 6) an advantage of making a functional-programming representation XML based , is that in many cases the representation would not need to be interpreted by a real XGLT interpreter to be useful. For example it is easy to use XSLT to transform Example 1. below , into something like an SVG (http://www.w3.org/TR/SVG/) based display of the patterns of variation in an alignment, without even actually executing the various editing steps required to construct the reads. An XGLT dual of a protein reference sequence , as in Example ,. includes enough information to plot a rich feature track on a genome viewer, without actually executing the translation. Finally ,it would be desirable to provide some sort of theoretical context for the suggestions and examples presented here , and so I give a very tentative one. Comparing the two representations of an alignment of sequences in Example 1, both contain the same information, but one (the XGLT version) is projected into a space of functions, and the other into a geometric space. This is analogous to the duality betwen the time-domain and frequency domain representations of a mathematical function or data series. (Another analogy is with the duality between a vector space and the dual-space of linear functionals defined on that space) Others have pointed out a duality relationship between Object Oriented and Functional Programming languages. So the tentative theoretical context , is that expressions in XGLT languages would amount to almost formal duals of the original data and models. Therefore I would suggest the XGLT representation of something like an alignment (Example 1) or protein translation (Example 2) , be referred to as the "XGLT dual" or "functional programming dual" of the original , to emphasize that we are really dealing with the same information , but projected into a different space - one of functions. And just as working in the frequency domain can sometimes be a productive thing to do with a mathematical function or data series, so working in a dual functional-programming domain as suggested here may be productive for some purposes. I'd be grateful for any feedback (however harsh !) on my admittedly very naive proposal. Cheers Alan McCulloch --------------------------------------------------------------------- Example 1 --------------------------------------------------------------------- Set out below is a possible XGLT dual of the following alignment fragment : >Contig1 CGATCGAGCGTG read1 CGATCCGAGCGTG read2 GATC-GAGCGTG read3 GACC-AGGGTT read4 GACC-GAGCGT read5 ATC-GA ------------- CGATC-GAGCGTG CGATCGAGCGTG ------------------------------------------------------------------------ --------------- Example 2 ------------------------------------------------------------------------ --------------- ------------------------------------------------------------------------ -------------- Comment on Above Examples ------------------------------------------------------------------------ ------------- In these examples I have... 1) ...tried to suggest a functional style of programming, but an actual XGLT may look quite different. Transformations are declared and referenced inside other transformations, in a nested structure. Each transform stands alone , in that it first calls another transform that provides its starting point (and this transform may in turn involve a call to another transform, etc) 2) ...tried to demonstrate how an XGLT would convey valuable information about (in this example) the way the RefSeq was made, not just the sequence of the RefSeq itself. We not only achieve a succinct and in this case compressed expression of the actual sequence of the RefSeq, we also have an audit-trail of how the RefSeq was curated. 3) ...supposed that rather than a single xglt language/name-space, there would be a collection of namespaces such as xglt: basic language for expressing things in a functional programming manner - defining and referencing transforms etc. xbiopath: functions for referencing and extracting biological sequences from databases and genomes. The example given in (1) is a simple coordinate based extract , but one could also envisage specifying things like similarity based paths.... - this would result in the extraction of 2.5Kb sections of sequence, from all positions 2Kb upstream of any hg15 hits to the RefSeq that was constructed in example 1. xprotein: functions for working with protein primary and secondary structure xseqedit: basic functions for sequence editing. This example shows indels and changes - one can also envisage , say, masking and quality trimming functions that could be specified in a transform, as part of a pipeline. 4) noted that one would also want to be able to use XPath-ish (http://www.w3.org/TR/xpath) references, to other parts of the current or other XGLT documents. ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From ggolofit at wt.com.pl Mon Jun 2 11:33:59 2003 From: ggolofit at wt.com.pl (Grzegorz Golofit) Date: Mon Jun 2 16:01:34 2003 Subject: [DAS] mysqld running problem Message-ID: <000801c328e1$ba443250$f519a8c0@Market6> I installed mysql rpm packages included with RedHat 7.1 Seawolf edition. Installation caused no problem but when I tried to start mysqld with mysqld start as a root i got the following information: Could not fing mysqld command. Could anbody tell me whta I did wrong. From davidl at ebi.ac.uk Mon Jun 2 10:45:54 2003 From: davidl at ebi.ac.uk (davidl@ebi.ac.uk) Date: Tue Jun 3 07:39:30 2003 Subject: [DAS] Tax tree Message-ID: <2031.217.36.113.189.1054543554.squirrel@webmail.ebi.ac.uk> Hi Ujjwal, I decided to work at home today as it looked a bit wet to drive South. Being sort of forget full could you email me the last EXCEL file I sent you please - as I dont have it here. As I want to go over the Eukaryotic lineages... I attach a file (WORD) which is just another represntation and some notes for the non-eukaryotes, it differs slightly from what I sent before, but only in the groups that are unlikely to have an InterPro signatures with the possible exception of Prions - but I have put these in with 'Other' for the time being. Hope all Ok David -------------- next part -------------- A non-text attachment was scrubbed... Name: Tax1.doc Type: application/msword Size: 27648 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/das/attachments/20030602/da90a6fe/Tax1-0001.doc