From g38909015 at mailsrv.ym.edu.tw  Wed Aug  1 01:06:37 2001
From: g38909015 at mailsrv.ym.edu.tw (TerryYeh-YM)
Date: Wed, 1 Aug 2001 13:06:37 +0800
Subject: Join mailing list
Message-ID: <000a01c11a47$be86f3c0$46146e8c@nchc.gov.tw>


------------------------------------------------------ 
Chang-Wei Yeh (Terry Yeh) 
National Yang Ming University 
College of Life Science 
Institute of Anatomy and Cell Biology 
Bioinformatics Program and Core Lab 
------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20010801/59162129/attachment.html 

From simon.andrews at bbsrc.ac.uk  Wed Aug  1 08:56:08 2001
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Wed, 1 Aug 2001 13:56:08 +0100
Subject: [EMBOSS] Getting headers from Seqret
Message-ID: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk>

[sent to Emboss mailing list]

Dear All,

I'm having trouble getting header information back through seqret, from a
database formatted using dbiflat against a genbank flat file (refseq
actually).  I'm sure plenty of people must have done this before, but I've
read through the documentation, and I can't see where I'm going wrong!

The database formatted OK, and I can fetch sequences back from it, but at
some point I will need to retrieve the entire header from the original file
to get at some of the extra information in there (feature tables, cross
references, authors etc).  I've tried several different output USAs with
seqret, but the most I can seem to get back is the name, accession number
and description.

I can't believe that this information is thrown away by seqret (it's still
there in the flat file after all), so how can I retrieve it?

	Thanks for any help

	Simon

[Potentially useful details follow]

----
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews at bbsrc.ac.uk
+44 (0)1223 496463 


##########################################################################


	Emboss version = 2.0.0

	Platform = DEC alpha (OSF1 v4.0)


My emboss.default entry for the database looks like;

	DB refseq [
	        type: N
	        method: emblcd
	        format: gb
	        dir: /usr/users/andrewss/Refseq/Genbank
	        file: "*.gbff"
	        release: "1.0"
	        comment: "Refseq Hum Mus Rat"
	]

and an example of the output of seqret with a debug USA is (with the
documentation space suspiciously blank!);

Sequence output trace
=====================

  Name: 'NM_031360'
  Accession: 'NM_031360'
  Description: 'Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA.'
  Type: 'N'
  Database: 'refseq'
  Full name: ''
  Date: ''
  Usa: 'debug::test.seq'
  Ufo: ''
  Input format: 'gb'
  Output format: 'debug'
  Filename: 'test.seq'
  Entryname: 'NM_031360'
  File name: 'test.seq'
  Extension: 'fasta'
  Single: 'No'
  Features: 'No'
  Count: 'No'
  Documentation:...

    1  atgaagcaca acttttctct gcggctgagg gttttcaacc tcaactgctg    50
   51  ggacatcccc tacctaagca agcatagggc cgaccgcatg aagcgcttgg   100 

       etc.


The extra stuff I'm after is this sort of thing;

LOCUS       NM_031360    1269 bp    mRNA            ROD       12-JUN-2001
DEFINITION  Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA.
ACCESSION   NM_031360
VERSION     NM_031360.1  GI:14389300
KEYWORDS    .
SOURCE      Norway rat.
  ORGANISM  Rattus norvegicus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
            Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae;
            Rattus.
REFERENCE   1  (sites)
  AUTHORS   Mizutani,Y., Tamiya-Koizumi,K., Irie,F., Hirabayashi,Y., Miwa,M.
            and Yoshida,S.
  TITLE     Cloning and expression of rat neutral sphingomyelinase:
            enzymological characterization and identification of essential
            histidine residues
  JOURNAL   Biochim. Biophys. Acta 1485 (2-3), 236-246 (2000)
  MEDLINE   20292884
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to
final
            NCBI review. The reference sequence was derived from AB047002.1.
FEATURES             Location/Qualifiers
     source          1..1269
                     /organism="Rattus norvegicus"
                     /strain="Sprague-Dawley"
                     /db_xref="taxon:10116"
                     /chromosome="X"
                     /chromosome="14"
                     /chromosome="2"
                     /chromosome="3"
                     /chromosome="17"
                     /map="Xq28"
                     /map="14q"
                     /map="2 36.0 cM"
                     /map="Xq11.1"
                     /map="3"
                     /map="17q12-q21"
                     /sex="male"
                     /tissue_type="liver"
                     /clone_lib="rat liver lambda cDNA library
                     (STRATAGENE,#936513)"
     gene            1..1269
                     /gene="Smpd2"
                     /note="EBS3; EBS4; K14; CK; MAGE5; MAGE10; Tdo; Araf"
                     /db_xref="LocusID:83537"
                     /db_xref="MGD:MGI:98246"
                     /db_xref="MIM:148066"
                     /db_xref="MIM:300340"
                     /db_xref="MIM:300343"
                     /db_xref="MIM:601443"
                     /db_xref="RATMAP:36372"
                     /db_xref="RGD:36372"
     CDS             1..1269
                     /gene="Smpd2"
                     /note="lyso-platelet activating factor-phospholipase C;
                     cytokeratin 14; Raf related protein;
                     Synaptosomal-associated protein"
                     /codon_start=1
                     /db_xref="LocusID:83537"
                     /db_xref="MGD:MGI:98246"
                     /db_xref="MIM:148066"
                     /db_xref="MIM:300340"
                     /db_xref="MIM:300343"
                     /db_xref="MIM:601443"
                     /db_xref="RATMAP:36372"
                     /db_xref="RGD:36372"
                     /product="neutral sphingomyelinase"
                     /protein_id="NP_112650.1"
                     /db_xref="GI:14389301"
 
/translation="MKHNFSLRLRVFNLNCWDIPYLSKHRADRMKRLGDFLNLESFDL
 
ALLEEVWSEQDFQYLKQKLSLTYPDAHYFRSGIIGSGLCVFSRHPIQEIVQHVYTLNG
 
YPYKFYHGDWFCGKAVGLLVLHLSGLVLNAYVTHLHAEYSRQKDIYFAHRVAQAWELA
 
QFIHHTSKKANVVLLCGDLNMHPKDLGCCLLKEWTGLRDAFVETEDFKGSEDGCTMVP
 
KNCYVSQQDLGPFPFGVRIDYVLYKAVSGFHICCKTLKTTTGCDPHNGTPFSDHEALM
 
ATLCVKHSPPQEDPCSAHGSAERSALISALREARTELGRGIAQARWWAALFGYVMILG
 
LSLLVLLCVLAAGEEAREVAIMLWTPSVGLVLGAGAVYLFHKQEAKSLCRAQAEIQHV
                     LTRTTETQDLGSEPHPTHCRQQEADRAEEK"
     misc_feature    91..837
                     /note="AP_endonucleas1; Region: AP endonuclease family
1"


From peter.rice at uk.lionbioscience.com  Wed Aug  1 09:12:57 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Wed, 01 Aug 2001 14:12:57 +0100
Subject: [EMBOSS] Getting headers from Seqret
References: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk>
Message-ID: <3B680059.74B4594@uk.lionbioscience.com>

"simon andrews (BI)" wrote:
> The database formatted OK, and I can fetch sequences back from it, but
> at some point I will need to retrieve the entire header from the
> original file to get at some of the extra information in there
> (feature tables, cross references, authors etc).
>
> I've tried several different output USAs with
> seqret, but the most I can seem to get back is the name, accession number
> and description.

It all depends on how much information we store in the internal data
structures. As standard, we keep the ID, Accession, Description and
sequence so we can write a FASTA format file easily.

We also keep the complete feature table, but only optionally. seqret
ignores it, but seqretallfeat reads and writes it. Most programs only need
the sequence data and parsing feature information wastes time and space on
large sequences.

We can also read the entire text of an entry with entret, assuming you want
the original flatfile format.

>I can't believe that this information is thrown away by seqret
> (it's still there in the flat file after all),

Yes, it is (but we can easily read more fields - the problem is whether we
can convert them to other file formats easily)

> so how can I retrieve it?

Using entret - which sounds like the solution you need.

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From ableasby at hgmp.mrc.ac.uk  Wed Aug  1 13:15:25 2001
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Wed, 1 Aug 2001 18:15:25 +0100 (BST)
Subject: EMBOSS patchfiles directory
Message-ID: <200108011715.SAA26106@bromine.hgmp.mrc.ac.uk>

Just a reminder that, between EMBOSS releases, occasional bugfixes
are placed in the directory:

  ftp://ftp.uk.embnet.org/pub/EMBOSS/patchfiles/

There are currently two replacement files in that directory.

    marscan.c
    showfeat.c

Both are replacements for applications in the EMBOSS-2.0.1/emboss
directory.

Alan


From gbottu at ben.vub.ac.be  Thu Aug  2 13:00:02 2001
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Thu, 2 Aug 2001 19:00:02 +0200 (MET DST)
Subject: databanks in PIR format
Message-ID: <200108021700.TAA24275@bigben.vub.ac.be>

from : BEN

	Dear colleagues,
	
Has anybody already successfully accessed databanks in PIR NBRF or CODATA format 
under EMBOSS ? 

I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is 
indexed under SRS). My emboss.default file contains :

DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant'
            methodquery: srs dbalias: PIR_NR
            methodall: direct dir: /seq/protein/flat file: pir_nr.seq
]

But this does not work. E.g.  seqret pir_nr:e69549  gives an output file :

>E69549 conserved hypothetical protein AF2396 - Archaeoglobus fulgidus
>E69549
MTVVPLSALREGQEGRVVAINGGRGCTARLMSMGIVPGKKIRIAGRRGGAVLVSVNGTKF
VIGRGLAMKVAVDVGEQG

	Guy Bottu


From peter.rice at uk.lionbioscience.com  Thu Aug  2 13:28:29 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Thu, 02 Aug 2001 18:28:29 +0100
Subject: databanks in PIR format
References: <200108021700.TAA24275@bigben.vub.ac.be>
Message-ID: <3B698DBD.9984ED23@uk.lionbioscience.com>

Guy Bottu wrote:
> I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is
> indexed under SRS). My emboss.default file contains :
> 
> DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant'
>             methodquery: srs dbalias: PIR_NR
>             methodall: direct dir: /seq/protein/flat file: pir_nr.seq
> ]
> 
> But this does not work. E.g.  seqret pir_nr:e69549  gives an output file 

This is because of problems in SRS converting PIR entries to PIR format.
This has been the same since the days of SRS 5, but I have passed it on to
the support guys here to take a look. Seems nobody has been retrieving PIR
entries in their original format.

For example, see PIR on the SRS 5 server at MIPS:

http://srs-mips.gsf.de/srs5bin/cgi-bin/wgetz?-id+2trYB1GreRI+-e+[PIR-ID:'E69549']

You can get queries to work with:

DB pir_nr [ type: P format: fasta comment: 'PIR nonredundant'
            methodquery: srsfasta dbalias: PIR_NR
            methodall: direct dir: /seq/protein/flat file: pir_nr.seq
]

... but the fasta format required for srsfasta will not let you work with
direct access to all entries.

srs access does getz -e

srsfasta access does getz -d -sf fasta

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From peter.rice at uk.lionbioscience.com  Thu Aug  2 13:59:20 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Thu, 02 Aug 2001 18:59:20 +0100
Subject: databanks in PIR format
References: <200108021700.TAA24275@bigben.vub.ac.be> <3B698DBD.9984ED23@uk.lionbioscience.com>
Message-ID: <3B6994F8.F4A5A403@uk.lionbioscience.com>

>This is because of problems in SRS converting PIR entries to PIR format.
>This has been the same since the days of SRS 5, but I have passed it on to
>the support guys here to take a look.

Quick fix would be to change the format in pir.i to be "plain" and run
srssection.

This gives PIR format without the trailing * but is good enough to make
EMBOSS happy. Then Guy's original definition should work.

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From gbottu at ben.vub.ac.be  Fri Aug  3 05:43:57 2001
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Fri, 3 Aug 2001 11:43:57 +0200 (MET DST)
Subject: databanks in PIR format
Message-ID: <200108030943.LAA11981@bigben.vub.ac.be>


>Quick fix would be to change the format in pir.i to be "plain" and run
>srssection.
>
>This gives PIR format without the trailing * but is good enough to make
>EMBOSS happy. Then Guy's original definition should work.
>

I tried and it worked ! Thanks for the advice.

Still, there must be some nasty bug hidden in the SRS code, since similar 
problem does not occur with EMBL and GenBank formats. Let's hope they can fix 
it.

	Guy Bottu


From gbottu at ben.vub.ac.be  Fri Aug  3 08:48:02 2001
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Fri, 3 Aug 2001 14:48:02 +0200 (MET DST)
Subject: problem with remote databank access
Message-ID: <200108031248.OAA26689@bigben.vub.ac.be>

from : BEN

	Dear support,
	
While experimenting with remote databank access I noticed the following :

DB GENBANK [ type: N format: genbank method: url
             comment: 'GenBank at Institut Pasteur (Paris, France)'
             url: "http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]"
]

does work fine. However, with :

DB GENBANK [ type: N format: genbank method: url
             comment: 'GenBank at DKFZ (Heidelberg, Germany)'
             
url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genbank-
acc:%s]"
]

seqret genbank:X15320 retrieves a file :

>ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC 
6.1.1.19

The problem is probably that at the DKFZ they index the databank in GCG format. 
However, replacing "format: genbank" by "format: gcg" does not work.

	Guy Bottu


From jackl at dalicon.com  Fri Aug  3 09:11:41 2001
From: jackl at dalicon.com (Jack Leunissen)
Date: Fri, 3 Aug 2001 15:11:41 +0200
Subject: problem with remote databank access
References: <200108031248.OAA26689@bigben.vub.ac.be>
Message-ID: <009001c11c1d$d74aaff0$0400a8c0@cmbipc32>

No, the problem is that their default output format is EMBL! And that seems
to upset
EMBOSS, as it expect GENBANK format for the sequence information too.

Changing the call to:
url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+-sf+g
enbank+[genbank-acc:%s]"
does the trick! (note the addition: +-sf+genbank" to force the sequence
output in GENBANK format).

Cheers,
Jack

   Jack A.M. Leunissen         Email: jackl at cmbi.kun.nl
   Centre for Molecular and    Tel  :  +31 24 365 22 48
   Biomolecular Informatics    Fax  :  +31 24 365 29 77
   Nijmegen, Netherlands       http://www.cmbi.kun.nl/


----- Original Message -----
From: "Guy Bottu" <gbottu at ben.vub.ac.be>
To: <emboss-bug at embnet.org>
Cc: <emboss at embnet.org>; <rherzog at bigben.vub.ac.be>
Sent: Friday, August 03, 2001 2:48 PM
Subject: problem with remote databank access


> from : BEN
>
> Dear support,
>
> While experimenting with remote databank access I noticed the following :
>
> DB GENBANK [ type: N format: genbank method: url
>              comment: 'GenBank at Institut Pasteur (Paris, France)'
>              url:
"http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]"
> ]
>
> does work fine. However, with :
>
> DB GENBANK [ type: N format: genbank method: url
>              comment: 'GenBank at DKFZ (Heidelberg, Germany)'
>
>
url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genb
ank-
> acc:%s]"
> ]
>
> seqret genbank:X15320 retrieves a file :
>
> >ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC
> 6.1.1.19
>
> The problem is probably that at the DKFZ they index the databank in GCG
format.
> However, replacing "format: genbank" by "format: gcg" does not work.
>
> Guy Bottu
>
>


From peter.rice at uk.lionbioscience.com  Fri Aug  3 11:07:39 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Fri, 03 Aug 2001 16:07:39 +0100
Subject: databanks in PIR format
References: <200108030943.LAA11981@bigben.vub.ac.be>
Message-ID: <3B6ABE3B.5EBE5C2C@uk.lionbioscience.com>

Guy Bottu wrote:
> >Quick fix would be to change the format in pir.i to be "plain" and run
> >srssection.
>
> I tried and it worked ! Thanks for the advice.
> 
> Still, there must be some nasty bug hidden in the SRS code, since similar
> problem does not occur with EMBL and GenBank formats. Let's hope they
> can fix it.

"It's not a bug, it's a feature"

As it has been there since SRS 5.0 (at least) requres changes to the C
source code (so that PIR format behaves the same way as EMBL) it will have
to wait for a future release.

Meanwhile, the plain fix will work well enough - some software may want a
trailing '*' but probably most programs will be happy.

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From dalke at dalkescientific.com  Sun Aug  5 20:52:59 2001
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 6 Aug 2001 01:52:59 +0100
Subject: questions about ACD format
Message-ID: <005101c11e12$470a8180$0201a8c0@josiah.dalkescientific.com>

[Brief summary: I'm trying to integrate Emboss with Biopython and
found that 1) not enough sequence type information is available
in the ACD file for Biopython's AlphabetStrict code to work, so
I have a proposal to fix the, 2) I have questions about how to
interpret some of the documentation, 3) there are places where
the Emboss ACD parser doesn't appear to work correctly, and 4)
general observations on the ACD format and on the implementation.]

Hello,

First off, my apologies if this is the wrong email address for
this topic.  I couldn't find any archives to scan for verification.
I am also not a member of this list, so please cc me on any
replies.

Based on the feedback I got from some people at ISMB, I've started a
Python interface to EMBOSS.  The goal is to be able to do something
like:

>>> from Bio import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Emboss import apps
>>>
>>> seq = Seq.Seq("AATCCATCGATGCAC", IUPAC.unambiguous_dna)
>>> results = apps.revseq(sequence = seq)
>>> results["outseq"]
Emboss.EmbossSeq("GTGCATCGATGGATT", IUPAC.ambiguous_dna)
>>>

I can almost, but not quite do this, for some reasons I'll describe
shortly.  Here are the questions and problems I had in doing this,
as well as some specific feature I would like to see added, which
I feel may make it easier to integrate EMBOSS with other systems.

======

** Topic 1

As you can see in the above example, there is some automatic
conversion going on.  One is to convert the Biopython 'Seq' object
to a temporary file, so it can be used with the '-sequence'
parameter needed by revseq.  This is done by knowing how to convert
the Seq object to a 'seqall' Emboss type, including looking at the
'type' field to ensure that the input sequence is really DNA.

The conversion step requires that I do a verification of the Biopython
Seq Alphabet to the Emboss sequence 'type'.  There is a description
of the types in the syntax document, at
  http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html
but it doesn't describe:
  1) what is used as a gap character? (I assume '-')
  2) what is used for a stop character? (I assume '*')
  3) are selenocysteines encoded with a U?  (the pureprotein definition
      says it excludes "BZ or X", so I'm guessing selenocysteines aren't
      allowed - or are they encoded as X?)
  4) shouldn't there be a gapstopprotein?

** Topic 2

Another conversion is to create a temporary filename for the -outseq
parameter, based on the 'seqoutall' Emboss type.  I would like to read
the contents of this file into a Biopython Seq object, however, the
ACD description does not contain enough information for me to do that.
Instead, I can only create the tempfile and store the filename in
the "outseq" parameter.

Could a new 'type' parameter be added to 'seqoutall'?  This would
change revseq's "outseq" definition to be

seqoutall: outseq  [
  parameter: "Y"
  type: "dna"
  extension: "rev"
]

For applications like 'notseq' this would require using an operation:

seqoutall: outseq [
  param: Y
  type: "@($(sequence.protein)? protein : nucleic)"
]

The goal of this is to let researchers use EMBOSS from Python
without having to worry about an implementation detail - the
existance of a file system.

(BTW, I have heard there may be support in 3.0 for XML output
and the ability for all the output to be streamed to stdout.
I didn't find any details about this on my scan of the web
pages - what is the status and plans for this?)

** Topic 3:

The Emboss sequence data type contains the calculated attributes of
'protein', 'nucleic' and 'type'.  Is it that:
  - protein is true when the sequence type is
      'protein', 'gapprotein', 'pureprotein' or 'stopprotein'
  - nucleic is true when the sequence type is
      'dna', 'rna', 'puredna', 'purerna', 'nucleotide', 'purenucleotide'
      'gapnucleotide', 'gapdna', 'gaprna',
  - protein and nucleic are false for any other case
  ?

** Topic 4:

How do I force a sequence type?  The -sprotein and -snucleotide
command-line qualifiers are only boolean values, so there doesn't seem
to be any way to say an input is really a pureprotein.  Eg, there
could be a '-stype' qualifier, so I can do '-stype pureprotein'.

** Topic 5:

Given the existence of sequence.type, shouldn't most operations of the
form
    "@($(sequence.protein)? protein : nucleic)"
really be
    "$(sequence.type)"
?

This should allow better propogation of proper type information
through Emboss.

==

Okay, that's the sequence type related topic.  Now for some others,
first on parsing ACD files.  To get the parameter information I read
the ACD files.  There are actually two possible files to read: the
".acd" file and the file produced from the "-acdpretty" option.


** Topic 6:

Which is the prefered mechanism for getting ACD configuration
information?  There are advantages and disadvantages to either one.

  - The .acd file does not require executing a possibly arbitrary
  program to get its parameter information.  This can be a subtle
  security problem because the mechanism I'm using just does a
  system() call to see if the program exists, and has no qualms in
  running "rm-rf / && echo", which expands to the valid command
  "rm -rf / && echo -acdpretty".  By checking the acd file first,
  it eliminates that possibility, although it does require that
  the directory containing the .acd definitions be well-known.
  Is this well-known directory $EMBOSS_ACDROOT or is that a 1.x
  location?

  (The other possibility is to require that all Emboss executables and
  only Emboss executables be in a well-defined directory.  Looking at
  the standard 'configure', the is not usually the case - they get put
  into /usr/local/bin )

  - a problem with using the .acd file is that it may be out of synch
  with the actual exectuable

  - the -acdpretty option is problematical in that it writes its
  information to a file in the local directory.  My Python code cannot
  guarantee that the local directory is writeable, so I need to mkdir
  a temp directory then "cd $(tmpdir) && $(program) -acdpretty" then
  read "$(tmpdir)/$(program).acdpretty" then remove the directory.  It
  would be so much easier if -acdpretty option could write to stdout.
  (Eg, as when used as '-acdpretty -stdout')

  - the .acd file may use abbreviated names.  For example, it may have
  a qualifier as "param" instead of "parameter".  So the -acdpretty
  text is easier to parse.

I would prefer getting the ACD data directly from the executable.  Is
is possible to allows -stdout as an option to -acdpretty to make it
dump to stdout?  The other issues I can work around.

** Topic 7:

The ACD syntax definition is incomplete.  Here are some problems I ran
across.

> Comments start with "#" and continue to the end of the line.

Must the '#' be in the first character position?

The function ajacd.c:acdNoComment looks like it truncates the line at
the first '#', no matter where it is in the string, so the '#' doesn't
need to be the first character.

On the other hand, it looks like that bit of code doesn't understand
quoted strings.  Consider

% cat foo.acd
appl: foo [
        doc: "Who is #1?"
        groups: "Edit"
]

% ../acdc foo
Who is groups: "Edit
%

> Each line is parsed into tokens delimited by spaces

What is the definition of a token?  We also have that

> Parameters and qualifiers are defined by a single token followed by
> either a colon ':' (preferred) [1] or an equal sign '=' which in
> turn is followed by a second token.

This means a token cannot end in a ':' or a '='.  But it can contain a
':' outside of quotes, as in

   opt: @($(showall)?N:Y)

Or consider

% cat foo.acd
appl: foo [
        doc: A:
]

% ../acdc foo
A:
%

This means the ':' is not part of the first token in a
parameter/qualifier but is part of the second token.

Spaces aren't really the token delimiter.  The file 'wordcount.acd'
contains

  sequence: sequence [ param: Y type: dna]

so the token 'dna' is not space delimited before the ']'.  Also,
checktrans.acd uses 'min:1' which is not space delimited.


I'm trying to figure out how ajacd.c does it, but I'm getting lost in
the code.

To make thing even more confusing

% cat foo.acd
appl: f"oo [
        doc:A]B
]


% ../acdc foo
A]B
%

Also, the term 'space' in the documentation should be 'whitespace'
since it can skip '\t' characters.  Hmm, and looking at the code,
there's problems with how it skips the ':' characters.

% cat foo.acd
appl:::: foo [
        doc: "This is the doc."
]

% ../acdc foo
This is the doc.
%

And using a NUL character
% od -c foo.acd
0000000   a   p   p   l   :       f   o   o       [  \n
0000020                   d   o   c  \0   :       "   H   a   s       a
0000040       N   U   L       c   h   a   r   a   c   t   e   r   "  \n
0000060                                   S   t   r   a   n   g   e  \n
0000100   ]  \n
0000102
% ../acdc foo
Strange
%

So the parser code does not fully validate that the input data is in
the correct format.

> After the name, definitions are in mandatory square brackets, [],
> which can make a definition span multiple lines.

seqretallfeat.acd contains the following two lines

  endsection: secoutseq
  endsection: secinseq

which don't have the [].  My parse ends up special casing the
'endsection' declarations.  Would it be possible to use, say,

  endsection: secoutseq []

instead?  (Also, section and endsection are not defined anywhere in
that syntax document.)

> Tokens representing data types can be abbreviated up to the point
> where they are not ambiguous

That's a VMS-help-style shortcut.  As I recall, that has a
forward-compatibility problem.  For example, if a new data type called
'apple' is added, then 'a', 'ap', 'app', and 'appl' are no longer
unambiguous.

Has there been any consideration on how to deal with that?

> Values can be delimited (i.e. treated as one token) by any of the
> following pairs, which are stripped as the value is parsed :
>
>      '' {} () [] <>

It's not clear what a "value" means?  In this section there is

token: token [
  definition
]

But later on this the word 'attributes' is used instead of
'definition':

data_type: parameter_name

    attributes
]

and only then does it say what a value is:

> A defining attribute must have a second token representing the value
> of the attribute.

So perhaps there should be some cleanup of the definition.  (The
reason I needed to figure this out was to check that

appl: foo [
  "multiword attribute": N,
]

was indeed supposed to be illegal.)

There doesn't appear to be any way to escape a quote character inside
of a quoted token.  At least, not that I could see in the code.  So there's
no way to write something like

appl: foo [
        doc: "Remove the characters ""{}<>()'"
]

for the string
  Remove the characters "{}<>()'

Also, the doc says the valid characters are
    '' {} () [] <>
but that should include "double quotes"

And just why are there so many quote characters?

** Topic 8:

It took me a while to figure out that ajacd.c did the ACD parsing.
The file ajnam.c parses the .embosssrc and emboss.defaults which is
described in
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Usa/databases.html and is
*almost* in ACD format.  The difference is that it doesn't have an
'application' term and the 'DB' needs to be 'DB:'.

I can tweak my parser to handle the 'DB' term, but why can't those two
files really be in ACD format?  ... Although implementation-wise the
ajacd.c file uses static variables so it can only be used to parse one
file.

I noticed a couple problems with how ajname.c works.

   - It only understands a comment as a '#' in the first character
   position (while ajacd.c recognizes it anywhere).

   - The code uses "fgets(line, 512, file)" which looks like it can
   fail if the line is more than 511 characters, as with long file
   names.

(Actually, since this is a completely different implementation of the
parser, the failure conditions are different.  For example, there is a
namNoColon in ajname.c, but nothing to strip a '='.)

** Topic 9

There needs to be some clarification on the license.  When I looked at
the code I read the top-level "COPYING" file, which is the GPL.  I
have a policy not to look at GPL'ed code too closely, since I worry
that it may contaminate my ability to write equivalent non-GPL code,
like the BSDed Biopython code.  LGPL is not quite as bad, but even
then I write the non-FSF-licensed code first then if needed for
verification I look at the LGPL'ed code.

Since the top-level COPYING file is the GPL, that put me off looking
at any of the source code, even for verifying format requirements.  It
had to be pointed out to me that the ajax and nucleus codes are
covered under the LGPL.  I would not have discovered that on my own,
because it the multiple license use wasn't mentiond in the README.

In addition, I noticed the ./LICENSE is slightly different than the
current Version 2, June 1991 one from the FSF.  The FSF address is
wrong, and there are formatting changes.  I cannot tell if there are
any text changes.

I also noticed ./COPYING file is the GPL, except for a change in the
address and the exclusion of the section "How to Apply These Terms to
Your New Programs"

Shouldn't these be identical files, and match the current FSF GPL?


** Topic 10

What does the 'warnrange' attribute of an integer do?  (I've only
lightly scanned the table of data types so will likely have more
questions about the other fields in the future.)


** Topic 11

In scanning the code I noticed there is an indirection layer, which I
assume is to isolate the programmer from changes in the OS and C
library.  It isn't used everywhere.  For example, there's an
ajNamGetenv but several places call getenv directl.

I also did a scan looking for possible overflows and other security
problems.  Because of my inexperience with the indirection layer I
couldn't do an in-depth check, but I did notice that ajStrFromFloat
and ajStrFromDouble can fail on Inf, -Inf and NaN, for a couple of
reasons:

% cat inf.c
#include <stdio.h>
#include <math.h>
main() {
  /* float val = -1.0/0.0; */
  float val = strtod("-inf", NULL);
  char s[100];
  int precision = 0, ival, i;

  sprintf(s, "val == >>%.0f<<", val);
  puts(s);

  ival = abs((int) val);
  printf("ival = %d\n", ival);

  if (ival)
    i = precision + (int) log10((double)ival) + 4;
  else
    i = precision + 4;

  printf("i == %d\n", i);
}
% cc inf.c -lm
% ./a.out
val == >>-inf<<
ival = -2147483648
i == -2147483644
%

** Topic 12

Here's my first pass of the BNF for the ACD file.  There are various
things to fix, some of which are noted.  This can be used for every
file in the emoss/acd directory except qatest.acd (which contains a
syntax error that acdc doesn't catch -- the "int bint" field) and
testplot.acd (contains an '=' instead of a ':', which I don't yet
handle).

Lexer:
  colon = ":"
  open_block = "\["
  close_block = "\]"
  endsection = "endsection"
  key = "(?!endsection)[a-zA-Z0-9_]+(?=[\s:\][])"
  value = "[a-zA-Z0-9_]+(?![\s:\][])[^\s\]]* |
           [^\000-\037a-zA-Z0-9_:[\]\s][^\s\]]*"
  quoted = '"[^"]*"'   (only handles double quotes - need to fix)
  comment = "[#][^\n\r]*(\r|\r?\n)" SKIPPED
  whitespace = "\s+" SKIPPED

Parser:  (need to update the names to match the syntax doc)
  application ::= widget_list

  widget_list ::= widget |
                  widget widget_list

  widget ::= key colon key open_block arglist close_block
           | key colon key key
           | key colon key value
           | endsection colon key

  arglist ::= arg
           |  arg arglist

  arg ::= key colon key
        | key colon value

** Topic 13

One last thing.  The parameter information for the different ACD
data types is hard coded in ajacd.c.  If it was stored in an external
data file (in ACD format with well-defined fields :) then my Python
code could read that meta-information to build up its tables, rather
than me having to code it all by hand.


Hope this wasn't too much at once :)

    Andrew
    dalke at dalkescientific.com


From 962856211 at tay.ac.uk  Tue Aug  7 11:17:47 2001
From: 962856211 at tay.ac.uk (962856211 at tay.ac.uk)
Date: Tue, 7 Aug 2001 16:17:47 +0100
Subject: free downloads?
Message-ID: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk>

The list of programs you have at
http://www.uk.embnet.org/Software/EMBOSS/Apps/

is it a list of freedownloads?
Barry Marshall BSC Hons
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20010807/334a6ae7/attachment.html 

From gwilliam at hgmp.mrc.ac.uk  Tue Aug  7 11:27:39 2001
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Tue, 07 Aug 2001 16:27:39 +0100
Subject: free downloads?
References: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk>
Message-ID: <3B7008EB.F4437A4F@hgmp.mrc.ac.uk>

> 962856211 at tay.ac.uk wrote:
> 
> The list of programs you have at
> http://www.uk.embnet.org/Software/EMBOSS/Apps/
> 
> is it a list of freedownloads?
> Barry Marshall BSC Hons


This is a list of the applications in the EMBOSS package.
The package can be downloaded for free (under the GPL licence) 

See:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From dmartin at bioinformatics.msiwtb.dundee.ac.uk  Tue Aug  7 12:22:31 2001
From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin)
Date: Tue, 7 Aug 2001 17:22:31 +0100 (BST)
Subject: free downloads?
In-Reply-To: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk>
Message-ID: <Pine.LNX.4.33.0108071713200.23186-100000@bioinformatics.msiwtb.dundee.ac.uk>

On Tue, 7 Aug 2001 962856211 at tay.ac.uk wrote:

> The list of programs you have at
> http://www.uk.embnet.org/Software/EMBOSS/Apps/
>
> is it a list of freedownloads?

EMBOSS is a freely downloadable package licensed under the GPL/LGPL.

You will probably want a unix/linux system on which to install it. The
admin guide describes in excruciating detail how to do this (look in the
documentation section of the web site).

If you are in Dundee (at least your email address is) then drop by if you
have any questions.

..d


----------------------------------
David Martin PhD
Bioinformatics Scientific Officer
Wellcome Trust Biocentre, Dundee
----------------------------------


From cbonnard at isrec-sg1.unil.ch  Mon Aug 20 05:09:11 2001
From: cbonnard at isrec-sg1.unil.ch (Claude Bonnard)
Date: Mon, 20 Aug 2001 11:09:11 +0200
Subject: Database access for EMBOSS
Message-ID: <10108201109.ZM13075@isrec-sg1>

Hello,

It is not very surprising that SRS is the  best mode  for a fast access to the
sequence databases from EMBOSS. As I understood, the URL mode allows the access
to a SINGLE sequence and would not support the "USA" standard (wild card query)
as SRS mode does.

If it is the case, is there a solution when the SRS server is NOT on the same
machine, but on a machine which is dedicated to SRS? I have in mind a rsh type
of request and I would like to know if someone experience this type of problem
and could help me in solving that.

Thanks a lot

Regards
Claude

-- 
Claude Bonnard Ph.D.
ISREC (Swiss Institute for Experimental Cancer Research)
Bioinformatics Group
Ch des Boveresses 155
CH-1066 Epalinges
Switzerland
phone: [41-21]-692-5891/-2236
  fax: [41-21]-652-6933


From peter.rice at uk.lionbioscience.com  Mon Aug 20 05:20:55 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 20 Aug 2001 10:20:55 +0100
Subject: Database access for EMBOSS
References: <10108201109.ZM13075@isrec-sg1>
Message-ID: <3B80D677.99A6DCC8@uk.lionbioscience.com>

Claude Bonnard wrote:
> It is not very surprising that SRS is the  best mode  for a fast access
> to the sequence databases from EMBOSS. As I understood, the URL mode
> allows the access to a SINGLE sequence and would not support the
> "USA" standard (wild card query) as SRS mode does.

True. We could add an "SRSREMOTE" access mode to extend queries, easy to
program but maybe limited practical use.
 
> If it is the case, is there a solution when the SRS server is NOT
> on the same machine, but on a machine which is dedicated to SRS?
> I have in mind a rsh type of request and I would like to know if
> someone experience this type of problem and could help me in
> solving that.

SRS access mode allows you to define the name of the getz program.

How about an alternative name that is a script, and uses rsh to run a
remote getz and returns the results?

For example, if your script is called 'remotegetz' just add this to the
database definition:

app: remotegetz

(you can use the full path if needed)

Note: This was originally added because the Sanger Centre ran 2 versions of
SRS (5.1 and 6.0) and I needed to switch between them, but it has other
possible uses.


-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From dmartin at bioinformatics.msiwtb.dundee.ac.uk  Mon Aug 20 05:34:19 2001
From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin)
Date: Mon, 20 Aug 2001 10:34:19 +0100 (BST)
Subject: Database access for EMBOSS
In-Reply-To: <3B80D677.99A6DCC8@uk.lionbioscience.com>
Message-ID: <Pine.LNX.4.33.0108201030440.6257-100000@bioinformatics.msiwtb.dundee.ac.uk>

On Mon, 20 Aug 2001, Peter Rice wrote:

> Claude Bonnard wrote:
> > It is not very surprising that SRS is the  best mode  for a fast access
> > to the sequence databases from EMBOSS. As I understood, the URL mode
> > allows the access to a SINGLE sequence and would not support the
> > "USA" standard (wild card query) as SRS mode does.
>
> True. We could add an "SRSREMOTE" access mode to extend queries, easy to
> program but maybe limited practical use.
>
> > If it is the case, is there a solution when the SRS server is NOT
> > on the same machine, but on a machine which is dedicated to SRS?
> > I have in mind a rsh type of request and I would like to know if
> > someone experience this type of problem and could help me in
> > solving that.
>
> SRS access mode allows you to define the name of the getz program.
>
> How about an alternative name that is a script, and uses rsh to run a
> remote getz and returns the results?
>
> For example, if your script is called 'remotegetz' just add this to the
> database definition:
>
> app: remotegetz
>
> (you can use the full path if needed)
>
> Note: This was originally added because the Sanger Centre ran 2 versions of
> SRS (5.1 and 6.0) and I needed to switch between them, but it has other
> possible uses.

This would then allow one to add whichever script one wanted as long as it
could parse srs style arguements..

It doesn't have to be SRS, just look like it.. The potential is there for
wrapping in house rdbms with such a script.

I'll add some more comments to the admin guide if Peter can send me
details of how EMBOSS calls the wgetz program (not being much of an srs
hacker myself).


..d


----------------------------------
David Martin PhD
Bioinformatics Scientific Officer
Wellcome Trust Biocentre, Dundee
----------------------------------


From peter.rice at uk.lionbioscience.com  Mon Aug 20 05:50:22 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 20 Aug 2001 10:50:22 +0100
Subject: Database access for EMBOSS
References: <Pine.LNX.4.33.0108201030440.6257-100000@bioinformatics.msiwtb.dundee.ac.uk>
Message-ID: <3B80DD5E.F61846E1@uk.lionbioscience.com>

David Martin wrote:
> 
> On Mon, 20 Aug 2001, Peter Rice wrote:
> > For example, if your script is called 'remotegetz' just add this to the
> > database definition:
> >
> > app: remotegetz
> 
> This would then allow one to add whichever script one wanted as long
> as it could parse srs style arguements..
> 
> It doesn't have to be SRS, just look like it.. The potential is there
> for wrapping in house rdbms with such a script.
> 
> I'll add some more comments to the admin guide if Peter can send me
> details of how EMBOSS calls the wgetz program (not being much of an srs
> hacker myself).

This is getz, not wgetz. It supports the full SRS query language because it
calls getz (or a user defined script) with an SRS query constructed from
the USA.

But there is also an access method in general for external applications.
You can use this to set up RDBMS calls - which anyway was the original
intention.

At present it picks up dbname:id or dbname:acc as the rest of the command
line, or puts the id/accession into a formatted string (if the application
definition includes %s), but can easily be adapted further.

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From cutler at tularik.com  Mon Aug 27 14:34:15 2001
From: cutler at tularik.com (Gene Cutler)
Date: Mon, 27 Aug 2001 11:34:15 -0700
Subject: drawing trees
Message-ID: <a05101006b7b041c61d1e@[192.168.50.41]>

Hello, all.  I have a question about phylogenetic-type trees for 
sequences.  I haven't quite figured out how to do this using 
emboss/phylip.  This is how I have been doing this with gcg:

perform multiple sequence alignment (generally hmmalign or clustalw)
convert file to msf format if not already (e.g., sreformat from hmmer package)
run gcg program distances on the msf file
run gcg program growtree on the distances file
end up with a postscript file

How would I do this with PHYLIP instead?

Thanks.


From stein at fieldmuseum.org  Mon Aug 27 15:20:00 2001
From: stein at fieldmuseum.org (Jennifer Steinbachs)
Date: Mon, 27 Aug 2001 14:20:00 -0500 (CDT)
Subject: drawing trees
In-Reply-To: <a05101006b7b041c61d1e@[192.168.50.41]>
Message-ID: <Pine.LNX.4.33.0108271412360.30944-100000@mail.fmnh.org>


Use your favourite alignment program...
Put your aligned sequences into PHYLIP format
Run the appropriate phylip program...
    distance-based methods:
	protdist (for proteins)
	dnadist (for dna)
    parsimony
	protpars
	dnapars
    likelihood
	dnaml or dnamlk
	protml

See the phylip website
for more info (http://evolution.genetics.washington.edu/phylip.html).

If you aren't certain of the differences between the different
tree-building algorithms, you should get your hands on Nei and Kumar 2000
Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
reference for phylogenetics is Hillis et al 1996 Molecular Systematics
ISBN 0-87893-282-8 Chapter 11.

-jennifer


On Mon, 27 Aug 2001, Gene Cutler wrote:

 >Hello, all.  I have a question about phylogenetic-type trees for
 >sequences.  I haven't quite figured out how to do this using
 >emboss/phylip.  This is how I have been doing this with gcg:
 >
 >perform multiple sequence alignment (generally hmmalign or clustalw)
 >convert file to msf format if not already (e.g., sreformat from hmmer package)
 >run gcg program distances on the msf file
 >run gcg program growtree on the distances file
 >end up with a postscript file
 >
 >How would I do this with PHYLIP instead?
 >
 >Thanks.
 >
 >


-----------------------------------
J. Steinbachs, Ph.D.
Computational Biologist
http://compbiology.org
-----------------------------------


From cutler at tularik.com  Mon Aug 27 16:19:56 2001
From: cutler at tularik.com (Gene Cutler)
Date: Mon, 27 Aug 2001 13:19:56 -0700
Subject: drawing trees
In-Reply-To: <Pine.LNX.4.33.0108271412360.30944-100000@mail.fmnh.org>
References: <Pine.LNX.4.33.0108271412360.30944-100000@mail.fmnh.org>
Message-ID: <a05101017b7b05a35d728@[192.168.50.41]>

Thanks Jennifer.  One more question:

>Put your aligned sequences into PHYLIP format

I didn't see any information on the phylip webpage about phylip format
and/or conversion tools.  I can do the conversion myself if I can find
documentation on the format.

>If you aren't certain of the differences between the different
>tree-building algorithms, you should get your hands on Nei and Kumar 2000
>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
>reference for phylogenetics is Hillis et al 1996 Molecular Systematics
>ISBN 0-87893-282-8 Chapter 11.

That's useful too.  Thanks again.


-- 

-=-=-=-=-=-=-=-=-=-=-=-=-=-
Gene Cutler
Bioinformatics Scientist
cutler at tularik.com
- - - - - - - - - - - - -
Tularik Inc
2 Corporate Drive
South San Francisco, CA 94080, USA
http://www.tularik.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-


From stein at fieldmuseum.org  Mon Aug 27 17:09:55 2001
From: stein at fieldmuseum.org (Jennifer Steinbachs)
Date: Mon, 27 Aug 2001 16:09:55 -0500 (CDT)
Subject: drawing trees
In-Reply-To: <a05101017b7b05a35d728@[192.168.50.41]>
Message-ID: <Pine.LNX.4.33.0108271602370.31408-100000@mail.fmnh.org>


I like to use Seaview to make the conversion (I don't have the website
handy but a google search should produce it quickly).  ClustalX (and maybe
clustalw) also produce Phylip files.  I thought the Phylip website had
information on the format, but it's been a while since I've actually
perused the documentation.  The phylip docs should definitely have
complete information.

If I recall correctly, it is something like:

#sequences #nucleotide_sites
sequence_name sequence
sequence_name sequence

etc.

There used to be a 10 character limit on sequence_name, but I don't know
if that holds with the latest version - I use PAUP* mostly for my
analyses.  Sequence can be non-interleaved or interleaved.

-jennifer

On Mon, 27 Aug 2001, Gene Cutler wrote:

 >Thanks Jennifer.  One more question:
 >
 >>Put your aligned sequences into PHYLIP format
 >
 >I didn't see any information on the phylip webpage about phylip format
 >and/or conversion tools.  I can do the conversion myself if I can find
 >documentation on the format.
 >
 >>If you aren't certain of the differences between the different
 >>tree-building algorithms, you should get your hands on Nei and Kumar 2000
 >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
 >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics
 >>ISBN 0-87893-282-8 Chapter 11.
 >
 >That's useful too.  Thanks again.
 >
 >
 >

-- 
-----------------------------------
J. Steinbachs, Ph.D.
Computational Biologist
http://compbiology.org
-----------------------------------


From jrvalverde at cnb.uam.es  Tue Aug 28 02:05:15 2001
From: jrvalverde at cnb.uam.es (jrvalverde at cnb.uam.es)
Date: Tue, 28 Aug 2001 08:05:15 +0200 (DST)
Subject: drawing trees
In-Reply-To: <a05101017b7b05a35d728@[192.168.50.41]>
Message-ID: <200108280605.f7S65GE1348757@embnet.cnb.uam.es>

Gene Cutler <cutler at tularik.com> wrote:
> Thanks Jennifer.  One more question:
> 
> >Put your aligned sequences into PHYLIP format
> 
> I didn't see any information on the phylip webpage about phylip format
> and/or conversion tools.  I can do the conversion myself if I can find
> documentation on the format.

It's on the package documentation, but if you are already using CLUSTAL,
then simply go to "multiple alignments", choose "output options" and
then select "PHYLIP" format.

As for the details, either look at the documentation ("main.doc") or
to EMBnet's Quick Guide to PHYLIP (PDF and HTML versions may still be
found at
	http://www.es.embnet.org/~pprpc/activs/PHYLIPGuide/PhylipGuide-1.6.html

				j


From frank at bioss.ac.uk  Tue Aug 28 03:45:33 2001
From: frank at bioss.ac.uk (Frank Wright)
Date: Tue, 28 Aug 2001 08:45:33 +0100
Subject: drawing trees
References: <a05101006b7b041c61d1e@[192.168.50.41]>
Message-ID: <3B8B4C1D.40315E3C@bioss.ac.uk>

Hi Gene,

>I didn't see any information on the phylip webpage about phylip format
>and/or conversion tools.  I can do the conversion myself if I can find
>documentation on the format.

PHYLIP FORMAT
-------------

PHYLIP format is discussed in the PHYLIP documentation "main" file:

   http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html

See section 6.Overview of the input and output formats/ subsection
1.Input File Format 
See the PHYLIP "sequences" documentation for details of how PHYLIP codes
unknowns and gaps:

   http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html 

READSEQ
-------
The READSEQ program is very useful for converting between alignment
formats (you may have to edit "." to be "-" though as PHYLIP codes gaps
differently). 

ftp://ftp.bio.indiana.edu/molbio/readseq/

To use:

  Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a

There is an old version and a Java version.  I prefer the old one :-).


Other programs
--------------

Jennifer Steinbachs has suggested some programs for reformatting
alignments.  Here's some additional comments:

There are other programs that can reformat alignment files.  CLUSTALW
can be used to read in alignments (option 1) and write them out (option
2, suboption 9) without doing an alignment.  I've not used SEQIO but it
looks useful:

http://bioweb.pasteur.fr/docs/seqio/seqio.html

On a PC you could use "export" and "import" facilities in GENEDOC, an
excellent alignment editor.

http://www.psc.edu/biomed/genedoc/


Best Wishes,
Frank
--
Frank Wright
Biomathematics and Statistics Scotland, 
SCRI, DUNDEE DD2 5DA, Scotland
frank at bioss.sari.ac.uk


From letondal at pasteur.fr  Tue Aug 28 03:51:50 2001
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue, 28 Aug 2001 09:51:50 +0200
Subject: drawing trees 
In-Reply-To: Your message of "Tue, 28 Aug 2001 08:45:33 BST."
             <3B8B4C1D.40315E3C@bioss.ac.uk> 
Message-ID: <200108280751.f7S7poM220418@electre.pasteur.fr>


Frank Wright wrote:
> Hi Gene,
> 
> >I didn't see any information on the phylip webpage about phylip format
> >and/or conversion tools.  I can do the conversion myself if I can find
> >documentation on the format.

Our Phylip Web server (http://bioweb/seqanal/phylogeny/phylip-uk.html) may help
with format conversion as well as phylogenetic programs chaining.

> 
> PHYLIP FORMAT
> -------------
> 
> PHYLIP format is discussed in the PHYLIP documentation "main" file:
> 
>    http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html
> 
> See section 6.Overview of the input and output formats/ subsection
> 1.Input File Format 
> See the PHYLIP "sequences" documentation for details of how PHYLIP codes
> unknowns and gaps:
> 
>    http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html 
> 
> READSEQ
> -------
> The READSEQ program is very useful for converting between alignment
> formats (you may have to edit "." to be "-" though as PHYLIP codes gaps
> differently). 
> 
> ftp://ftp.bio.indiana.edu/molbio/readseq/
> 
> To use:
> 
>   Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a
> 
> There is an old version and a Java version.  I prefer the old one :-).
> 
> 
> Other programs
> --------------
> 
> Jennifer Steinbachs has suggested some programs for reformatting
> alignments.  Here's some additional comments:
> 
> There are other programs that can reformat alignment files.  CLUSTALW
> can be used to read in alignments (option 1) and write them out (option
> 2, suboption 9) without doing an alignment.  I've not used SEQIO but it
> looks useful:
> 
> http://bioweb.pasteur.fr/docs/seqio/seqio.html
> 
> On a PC you could use "export" and "import" facilities in GENEDOC, an
> excellent alignment editor.
> 
> http://www.psc.edu/biomed/genedoc/
> 
> 
> 
> Best Wishes,
> Frank
> --
> Frank Wright
> Biomathematics and Statistics Scotland, 
> SCRI, DUNDEE DD2 5DA, Scotland
> frank at bioss.sari.ac.uk

--
Catherine Letondal -- Pasteur Institute Computing Center


From frank at bioss.ac.uk  Tue Aug 28 04:04:13 2001
From: frank at bioss.ac.uk (Frank Wright)
Date: Tue, 28 Aug 2001 09:04:13 +0100
Subject: drawing trees
References: <a05101006b7b041c61d1e@[192.168.50.41]>
Message-ID: <3B8B507D.B7639FE5@bioss.ac.uk>

Hi All,

Gene Cutler asked:

>Hello, all.  I have a question about phylogenetic-type trees for 
>sequences.  I haven't quite figured out how to do this using 
>emboss/phylip.  This is how I have been doing this with gcg:
>
>run gcg program distances on the msf file
>run gcg program growtree on the distances file
>
>How would I do this with PHYLIP instead?

The GCG DISTANCES program and GCG GROWTREE programs are very similar to
the DNADIST/PROTDIST and Neighbor programs in PHYLIP.  In other words,
they allow phylogenetic trees to be constructed using "distance-based"
methods, but do not allow maximum likelihood or parsimony methods to be
used.  They also don't do bootstrapping tests, tree comparisons, and
lots of other things.

If you are using distance-based phylogenetic methods, some notes:

(1) Weighted least-squares is slower but more accurate than
Neighbor-Joining, so use the PHYLIP FITCH program instead of NEIGHBOR.

(2) Recently, the PHYLIP-like WEIGHBOR program (Weighted
Neighbor-Joining) has been released.  WEIGHBOR appears to be an
improvement on Neighbor-Joining (and possibly weighted least squares). 
See http://www.t10.lanl.gov/billb/weighbor/.  I've not tried it out much
but the simulations in the published paper look convincing.

(3) PHYLIP (version 3.6) has improved DNADIST (more distance methods)
and PROTDIST (rate heterogeneity among sites added).

Best Wishes,
Frank
-- 
Frank Wright
Biomathematics and Statistics Scotland, 
SCRI, DUNDEE DD2 5DA, Scotland
frank at bioss.sari.ac.uk


From dmartin at bioinformatics.msiwtb.dundee.ac.uk  Tue Aug 28 04:26:00 2001
From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin)
Date: Tue, 28 Aug 2001 09:26:00 +0100 (BST)
Subject: drawing trees
In-Reply-To: <Pine.LNX.4.33.0108271602370.31408-100000@mail.fmnh.org>
Message-ID: <Pine.LNX.4.33.0108280923580.7400-100000@bioinformatics.msiwtb.dundee.ac.uk>


Remember that in the Phylip programs distributed as an EMBASSY package,
EMBOSS will do the sequence conversions for you.

The programs to run are the same name with an e prepended. It also has the
various options in ACD format so the programs can be fully scripted. The
two programs that haven't been EMBOSSised are DRAWTREE and DRAWGRAM.

..d

On Mon, 27 Aug 2001, Jennifer Steinbachs wrote:

>
> I like to use Seaview to make the conversion (I don't have the website
> handy but a google search should produce it quickly).  ClustalX (and maybe
> clustalw) also produce Phylip files.  I thought the Phylip website had
> information on the format, but it's been a while since I've actually
> perused the documentation.  The phylip docs should definitely have
> complete information.
>
> If I recall correctly, it is something like:
>
> #sequences #nucleotide_sites
> sequence_name sequence
> sequence_name sequence
>
> etc.
>
> There used to be a 10 character limit on sequence_name, but I don't know
> if that holds with the latest version - I use PAUP* mostly for my
> analyses.  Sequence can be non-interleaved or interleaved.
>
> -jennifer
>
> On Mon, 27 Aug 2001, Gene Cutler wrote:
>
>  >Thanks Jennifer.  One more question:
>  >
>  >>Put your aligned sequences into PHYLIP format
>  >
>  >I didn't see any information on the phylip webpage about phylip format
>  >and/or conversion tools.  I can do the conversion myself if I can find
>  >documentation on the format.
>  >
>  >>If you aren't certain of the differences between the different
>  >>tree-building algorithms, you should get your hands on Nei and Kumar 2000
>  >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
>  >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics
>  >>ISBN 0-87893-282-8 Chapter 11.
>  >
>  >That's useful too.  Thanks again.
>  >
>  >
>  >
>
>

----------------------------------
David Martin PhD
Bioinformatics Scientific Officer
Wellcome Trust Biocentre, Dundee
----------------------------------


From pscotney at hotmail.com  Sat Aug 25 09:55:16 2001
From: pscotney at hotmail.com (Pierre Scotney)
Date: Sat, 25 Aug 2001 23:55:16 +1000
Subject: [EMBOSS] EMBOSS and Jemboss installation problems SOLVED!
Message-ID: <F9sn2Yf0scL9NtDEJmG0000037e@hotmail.com>


Hello!

I have solved the GNU/Linux EMBOSS and Jemboss installation problems :)

The solution was:

1) edit both /etc/profile (for bash) and /etc/csh.login (for csh) so that 
$PATH includes /usr/local/lib/j2sdk1.4.2/bin path, previously only bash had 
the correct path to the java binaries.

2) use j2sdk1.4.2 as EMBOSS-2.8.0 will not build with j2sdk1.3.1 (Blackdown 
Java-Linux).  May be the documentation/scripts will need to be changed to 
reflect this issue.

Cheers

Pierre

--
Dr Pierre Scotney
Melbourne
Australia

_________________________________________________________________
Get Extra Storage in 10MB, 25MB, 50MB and 100MB options now! Go to  
http://join.msn.com/?pgmarket=en-au&page=hotmail/es2


From g38909015 at mailsrv.ym.edu.tw  Wed Aug  1 05:06:37 2001
From: g38909015 at mailsrv.ym.edu.tw (TerryYeh-YM)
Date: Wed, 1 Aug 2001 13:06:37 +0800
Subject: Join mailing list
Message-ID: <000a01c11a47$be86f3c0$46146e8c@nchc.gov.tw>


------------------------------------------------------ 
Chang-Wei Yeh (Terry Yeh) 
National Yang Ming University 
College of Life Science 
Institute of Anatomy and Cell Biology 
Bioinformatics Program and Core Lab 
------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20010801/59162129/attachment-0001.html>

From simon.andrews at bbsrc.ac.uk  Wed Aug  1 12:56:08 2001
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Wed, 1 Aug 2001 13:56:08 +0100
Subject: [EMBOSS] Getting headers from Seqret
Message-ID: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk>

[sent to Emboss mailing list]

Dear All,

I'm having trouble getting header information back through seqret, from a
database formatted using dbiflat against a genbank flat file (refseq
actually).  I'm sure plenty of people must have done this before, but I've
read through the documentation, and I can't see where I'm going wrong!

The database formatted OK, and I can fetch sequences back from it, but at
some point I will need to retrieve the entire header from the original file
to get at some of the extra information in there (feature tables, cross
references, authors etc).  I've tried several different output USAs with
seqret, but the most I can seem to get back is the name, accession number
and description.

I can't believe that this information is thrown away by seqret (it's still
there in the flat file after all), so how can I retrieve it?

	Thanks for any help

	Simon

[Potentially useful details follow]

----
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews at bbsrc.ac.uk
+44 (0)1223 496463 


##########################################################################


	Emboss version = 2.0.0

	Platform = DEC alpha (OSF1 v4.0)


My emboss.default entry for the database looks like;

	DB refseq [
	        type: N
	        method: emblcd
	        format: gb
	        dir: /usr/users/andrewss/Refseq/Genbank
	        file: "*.gbff"
	        release: "1.0"
	        comment: "Refseq Hum Mus Rat"
	]

and an example of the output of seqret with a debug USA is (with the
documentation space suspiciously blank!);

Sequence output trace
=====================

  Name: 'NM_031360'
  Accession: 'NM_031360'
  Description: 'Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA.'
  Type: 'N'
  Database: 'refseq'
  Full name: ''
  Date: ''
  Usa: 'debug::test.seq'
  Ufo: ''
  Input format: 'gb'
  Output format: 'debug'
  Filename: 'test.seq'
  Entryname: 'NM_031360'
  File name: 'test.seq'
  Extension: 'fasta'
  Single: 'No'
  Features: 'No'
  Count: 'No'
  Documentation:...

    1  atgaagcaca acttttctct gcggctgagg gttttcaacc tcaactgctg    50
   51  ggacatcccc tacctaagca agcatagggc cgaccgcatg aagcgcttgg   100 

       etc.


The extra stuff I'm after is this sort of thing;

LOCUS       NM_031360    1269 bp    mRNA            ROD       12-JUN-2001
DEFINITION  Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA.
ACCESSION   NM_031360
VERSION     NM_031360.1  GI:14389300
KEYWORDS    .
SOURCE      Norway rat.
  ORGANISM  Rattus norvegicus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
            Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae;
            Rattus.
REFERENCE   1  (sites)
  AUTHORS   Mizutani,Y., Tamiya-Koizumi,K., Irie,F., Hirabayashi,Y., Miwa,M.
            and Yoshida,S.
  TITLE     Cloning and expression of rat neutral sphingomyelinase:
            enzymological characterization and identification of essential
            histidine residues
  JOURNAL   Biochim. Biophys. Acta 1485 (2-3), 236-246 (2000)
  MEDLINE   20292884
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to
final
            NCBI review. The reference sequence was derived from AB047002.1.
FEATURES             Location/Qualifiers
     source          1..1269
                     /organism="Rattus norvegicus"
                     /strain="Sprague-Dawley"
                     /db_xref="taxon:10116"
                     /chromosome="X"
                     /chromosome="14"
                     /chromosome="2"
                     /chromosome="3"
                     /chromosome="17"
                     /map="Xq28"
                     /map="14q"
                     /map="2 36.0 cM"
                     /map="Xq11.1"
                     /map="3"
                     /map="17q12-q21"
                     /sex="male"
                     /tissue_type="liver"
                     /clone_lib="rat liver lambda cDNA library
                     (STRATAGENE,#936513)"
     gene            1..1269
                     /gene="Smpd2"
                     /note="EBS3; EBS4; K14; CK; MAGE5; MAGE10; Tdo; Araf"
                     /db_xref="LocusID:83537"
                     /db_xref="MGD:MGI:98246"
                     /db_xref="MIM:148066"
                     /db_xref="MIM:300340"
                     /db_xref="MIM:300343"
                     /db_xref="MIM:601443"
                     /db_xref="RATMAP:36372"
                     /db_xref="RGD:36372"
     CDS             1..1269
                     /gene="Smpd2"
                     /note="lyso-platelet activating factor-phospholipase C;
                     cytokeratin 14; Raf related protein;
                     Synaptosomal-associated protein"
                     /codon_start=1
                     /db_xref="LocusID:83537"
                     /db_xref="MGD:MGI:98246"
                     /db_xref="MIM:148066"
                     /db_xref="MIM:300340"
                     /db_xref="MIM:300343"
                     /db_xref="MIM:601443"
                     /db_xref="RATMAP:36372"
                     /db_xref="RGD:36372"
                     /product="neutral sphingomyelinase"
                     /protein_id="NP_112650.1"
                     /db_xref="GI:14389301"
 
/translation="MKHNFSLRLRVFNLNCWDIPYLSKHRADRMKRLGDFLNLESFDL
 
ALLEEVWSEQDFQYLKQKLSLTYPDAHYFRSGIIGSGLCVFSRHPIQEIVQHVYTLNG
 
YPYKFYHGDWFCGKAVGLLVLHLSGLVLNAYVTHLHAEYSRQKDIYFAHRVAQAWELA
 
QFIHHTSKKANVVLLCGDLNMHPKDLGCCLLKEWTGLRDAFVETEDFKGSEDGCTMVP
 
KNCYVSQQDLGPFPFGVRIDYVLYKAVSGFHICCKTLKTTTGCDPHNGTPFSDHEALM
 
ATLCVKHSPPQEDPCSAHGSAERSALISALREARTELGRGIAQARWWAALFGYVMILG
 
LSLLVLLCVLAAGEEAREVAIMLWTPSVGLVLGAGAVYLFHKQEAKSLCRAQAEIQHV
                     LTRTTETQDLGSEPHPTHCRQQEADRAEEK"
     misc_feature    91..837
                     /note="AP_endonucleas1; Region: AP endonuclease family
1"


From peter.rice at uk.lionbioscience.com  Wed Aug  1 13:12:57 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Wed, 01 Aug 2001 14:12:57 +0100
Subject: [EMBOSS] Getting headers from Seqret
References: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk>
Message-ID: <3B680059.74B4594@uk.lionbioscience.com>

"simon andrews (BI)" wrote:
> The database formatted OK, and I can fetch sequences back from it, but
> at some point I will need to retrieve the entire header from the
> original file to get at some of the extra information in there
> (feature tables, cross references, authors etc).
>
> I've tried several different output USAs with
> seqret, but the most I can seem to get back is the name, accession number
> and description.

It all depends on how much information we store in the internal data
structures. As standard, we keep the ID, Accession, Description and
sequence so we can write a FASTA format file easily.

We also keep the complete feature table, but only optionally. seqret
ignores it, but seqretallfeat reads and writes it. Most programs only need
the sequence data and parsing feature information wastes time and space on
large sequences.

We can also read the entire text of an entry with entret, assuming you want
the original flatfile format.

>I can't believe that this information is thrown away by seqret
> (it's still there in the flat file after all),

Yes, it is (but we can easily read more fields - the problem is whether we
can convert them to other file formats easily)

> so how can I retrieve it?

Using entret - which sounds like the solution you need.

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From ableasby at hgmp.mrc.ac.uk  Wed Aug  1 17:15:25 2001
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Wed, 1 Aug 2001 18:15:25 +0100 (BST)
Subject: EMBOSS patchfiles directory
Message-ID: <200108011715.SAA26106@bromine.hgmp.mrc.ac.uk>

Just a reminder that, between EMBOSS releases, occasional bugfixes
are placed in the directory:

  ftp://ftp.uk.embnet.org/pub/EMBOSS/patchfiles/

There are currently two replacement files in that directory.

    marscan.c
    showfeat.c

Both are replacements for applications in the EMBOSS-2.0.1/emboss
directory.

Alan


From gbottu at ben.vub.ac.be  Thu Aug  2 17:00:02 2001
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Thu, 2 Aug 2001 19:00:02 +0200 (MET DST)
Subject: databanks in PIR format
Message-ID: <200108021700.TAA24275@bigben.vub.ac.be>

from : BEN

	Dear colleagues,
	
Has anybody already successfully accessed databanks in PIR NBRF or CODATA format 
under EMBOSS ? 

I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is 
indexed under SRS). My emboss.default file contains :

DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant'
            methodquery: srs dbalias: PIR_NR
            methodall: direct dir: /seq/protein/flat file: pir_nr.seq
]

But this does not work. E.g.  seqret pir_nr:e69549  gives an output file :

>E69549 conserved hypothetical protein AF2396 - Archaeoglobus fulgidus
>E69549
MTVVPLSALREGQEGRVVAINGGRGCTARLMSMGIVPGKKIRIAGRRGGAVLVSVNGTKF
VIGRGLAMKVAVDVGEQG

	Guy Bottu


From peter.rice at uk.lionbioscience.com  Thu Aug  2 17:28:29 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Thu, 02 Aug 2001 18:28:29 +0100
Subject: databanks in PIR format
References: <200108021700.TAA24275@bigben.vub.ac.be>
Message-ID: <3B698DBD.9984ED23@uk.lionbioscience.com>

Guy Bottu wrote:
> I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is
> indexed under SRS). My emboss.default file contains :
> 
> DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant'
>             methodquery: srs dbalias: PIR_NR
>             methodall: direct dir: /seq/protein/flat file: pir_nr.seq
> ]
> 
> But this does not work. E.g.  seqret pir_nr:e69549  gives an output file 

This is because of problems in SRS converting PIR entries to PIR format.
This has been the same since the days of SRS 5, but I have passed it on to
the support guys here to take a look. Seems nobody has been retrieving PIR
entries in their original format.

For example, see PIR on the SRS 5 server at MIPS:

http://srs-mips.gsf.de/srs5bin/cgi-bin/wgetz?-id+2trYB1GreRI+-e+[PIR-ID:'E69549']

You can get queries to work with:

DB pir_nr [ type: P format: fasta comment: 'PIR nonredundant'
            methodquery: srsfasta dbalias: PIR_NR
            methodall: direct dir: /seq/protein/flat file: pir_nr.seq
]

... but the fasta format required for srsfasta will not let you work with
direct access to all entries.

srs access does getz -e

srsfasta access does getz -d -sf fasta

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From peter.rice at uk.lionbioscience.com  Thu Aug  2 17:59:20 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Thu, 02 Aug 2001 18:59:20 +0100
Subject: databanks in PIR format
References: <200108021700.TAA24275@bigben.vub.ac.be> <3B698DBD.9984ED23@uk.lionbioscience.com>
Message-ID: <3B6994F8.F4A5A403@uk.lionbioscience.com>

>This is because of problems in SRS converting PIR entries to PIR format.
>This has been the same since the days of SRS 5, but I have passed it on to
>the support guys here to take a look.

Quick fix would be to change the format in pir.i to be "plain" and run
srssection.

This gives PIR format without the trailing * but is good enough to make
EMBOSS happy. Then Guy's original definition should work.

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From gbottu at ben.vub.ac.be  Fri Aug  3 09:43:57 2001
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Fri, 3 Aug 2001 11:43:57 +0200 (MET DST)
Subject: databanks in PIR format
Message-ID: <200108030943.LAA11981@bigben.vub.ac.be>


>Quick fix would be to change the format in pir.i to be "plain" and run
>srssection.
>
>This gives PIR format without the trailing * but is good enough to make
>EMBOSS happy. Then Guy's original definition should work.
>

I tried and it worked ! Thanks for the advice.

Still, there must be some nasty bug hidden in the SRS code, since similar 
problem does not occur with EMBL and GenBank formats. Let's hope they can fix 
it.

	Guy Bottu


From gbottu at ben.vub.ac.be  Fri Aug  3 12:48:02 2001
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Fri, 3 Aug 2001 14:48:02 +0200 (MET DST)
Subject: problem with remote databank access
Message-ID: <200108031248.OAA26689@bigben.vub.ac.be>

from : BEN

	Dear support,
	
While experimenting with remote databank access I noticed the following :

DB GENBANK [ type: N format: genbank method: url
             comment: 'GenBank at Institut Pasteur (Paris, France)'
             url: "http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]"
]

does work fine. However, with :

DB GENBANK [ type: N format: genbank method: url
             comment: 'GenBank at DKFZ (Heidelberg, Germany)'
             
url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genbank-
acc:%s]"
]

seqret genbank:X15320 retrieves a file :

>ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC 
6.1.1.19

The problem is probably that at the DKFZ they index the databank in GCG format. 
However, replacing "format: genbank" by "format: gcg" does not work.

	Guy Bottu


From jackl at dalicon.com  Fri Aug  3 13:11:41 2001
From: jackl at dalicon.com (Jack Leunissen)
Date: Fri, 3 Aug 2001 15:11:41 +0200
Subject: problem with remote databank access
References: <200108031248.OAA26689@bigben.vub.ac.be>
Message-ID: <009001c11c1d$d74aaff0$0400a8c0@cmbipc32>

No, the problem is that their default output format is EMBL! And that seems
to upset
EMBOSS, as it expect GENBANK format for the sequence information too.

Changing the call to:
url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+-sf+g
enbank+[genbank-acc:%s]"
does the trick! (note the addition: +-sf+genbank" to force the sequence
output in GENBANK format).

Cheers,
Jack

   Jack A.M. Leunissen         Email: jackl at cmbi.kun.nl
   Centre for Molecular and    Tel  :  +31 24 365 22 48
   Biomolecular Informatics    Fax  :  +31 24 365 29 77
   Nijmegen, Netherlands       http://www.cmbi.kun.nl/


----- Original Message -----
From: "Guy Bottu" <gbottu at ben.vub.ac.be>
To: <emboss-bug at embnet.org>
Cc: <emboss at embnet.org>; <rherzog at bigben.vub.ac.be>
Sent: Friday, August 03, 2001 2:48 PM
Subject: problem with remote databank access


> from : BEN
>
> Dear support,
>
> While experimenting with remote databank access I noticed the following :
>
> DB GENBANK [ type: N format: genbank method: url
>              comment: 'GenBank at Institut Pasteur (Paris, France)'
>              url:
"http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]"
> ]
>
> does work fine. However, with :
>
> DB GENBANK [ type: N format: genbank method: url
>              comment: 'GenBank at DKFZ (Heidelberg, Germany)'
>
>
url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genb
ank-
> acc:%s]"
> ]
>
> seqret genbank:X15320 retrieves a file :
>
> >ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC
> 6.1.1.19
>
> The problem is probably that at the DKFZ they index the databank in GCG
format.
> However, replacing "format: genbank" by "format: gcg" does not work.
>
> Guy Bottu
>
>


From peter.rice at uk.lionbioscience.com  Fri Aug  3 15:07:39 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Fri, 03 Aug 2001 16:07:39 +0100
Subject: databanks in PIR format
References: <200108030943.LAA11981@bigben.vub.ac.be>
Message-ID: <3B6ABE3B.5EBE5C2C@uk.lionbioscience.com>

Guy Bottu wrote:
> >Quick fix would be to change the format in pir.i to be "plain" and run
> >srssection.
>
> I tried and it worked ! Thanks for the advice.
> 
> Still, there must be some nasty bug hidden in the SRS code, since similar
> problem does not occur with EMBL and GenBank formats. Let's hope they
> can fix it.

"It's not a bug, it's a feature"

As it has been there since SRS 5.0 (at least) requres changes to the C
source code (so that PIR format behaves the same way as EMBL) it will have
to wait for a future release.

Meanwhile, the plain fix will work well enough - some software may want a
trailing '*' but probably most programs will be happy.

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From dalke at dalkescientific.com  Mon Aug  6 00:52:59 2001
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 6 Aug 2001 01:52:59 +0100
Subject: questions about ACD format
Message-ID: <005101c11e12$470a8180$0201a8c0@josiah.dalkescientific.com>

[Brief summary: I'm trying to integrate Emboss with Biopython and
found that 1) not enough sequence type information is available
in the ACD file for Biopython's AlphabetStrict code to work, so
I have a proposal to fix the, 2) I have questions about how to
interpret some of the documentation, 3) there are places where
the Emboss ACD parser doesn't appear to work correctly, and 4)
general observations on the ACD format and on the implementation.]

Hello,

First off, my apologies if this is the wrong email address for
this topic.  I couldn't find any archives to scan for verification.
I am also not a member of this list, so please cc me on any
replies.

Based on the feedback I got from some people at ISMB, I've started a
Python interface to EMBOSS.  The goal is to be able to do something
like:

>>> from Bio import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Emboss import apps
>>>
>>> seq = Seq.Seq("AATCCATCGATGCAC", IUPAC.unambiguous_dna)
>>> results = apps.revseq(sequence = seq)
>>> results["outseq"]
Emboss.EmbossSeq("GTGCATCGATGGATT", IUPAC.ambiguous_dna)
>>>

I can almost, but not quite do this, for some reasons I'll describe
shortly.  Here are the questions and problems I had in doing this,
as well as some specific feature I would like to see added, which
I feel may make it easier to integrate EMBOSS with other systems.

======

** Topic 1

As you can see in the above example, there is some automatic
conversion going on.  One is to convert the Biopython 'Seq' object
to a temporary file, so it can be used with the '-sequence'
parameter needed by revseq.  This is done by knowing how to convert
the Seq object to a 'seqall' Emboss type, including looking at the
'type' field to ensure that the input sequence is really DNA.

The conversion step requires that I do a verification of the Biopython
Seq Alphabet to the Emboss sequence 'type'.  There is a description
of the types in the syntax document, at
  http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html
but it doesn't describe:
  1) what is used as a gap character? (I assume '-')
  2) what is used for a stop character? (I assume '*')
  3) are selenocysteines encoded with a U?  (the pureprotein definition
      says it excludes "BZ or X", so I'm guessing selenocysteines aren't
      allowed - or are they encoded as X?)
  4) shouldn't there be a gapstopprotein?

** Topic 2

Another conversion is to create a temporary filename for the -outseq
parameter, based on the 'seqoutall' Emboss type.  I would like to read
the contents of this file into a Biopython Seq object, however, the
ACD description does not contain enough information for me to do that.
Instead, I can only create the tempfile and store the filename in
the "outseq" parameter.

Could a new 'type' parameter be added to 'seqoutall'?  This would
change revseq's "outseq" definition to be

seqoutall: outseq  [
  parameter: "Y"
  type: "dna"
  extension: "rev"
]

For applications like 'notseq' this would require using an operation:

seqoutall: outseq [
  param: Y
  type: "@($(sequence.protein)? protein : nucleic)"
]

The goal of this is to let researchers use EMBOSS from Python
without having to worry about an implementation detail - the
existance of a file system.

(BTW, I have heard there may be support in 3.0 for XML output
and the ability for all the output to be streamed to stdout.
I didn't find any details about this on my scan of the web
pages - what is the status and plans for this?)

** Topic 3:

The Emboss sequence data type contains the calculated attributes of
'protein', 'nucleic' and 'type'.  Is it that:
  - protein is true when the sequence type is
      'protein', 'gapprotein', 'pureprotein' or 'stopprotein'
  - nucleic is true when the sequence type is
      'dna', 'rna', 'puredna', 'purerna', 'nucleotide', 'purenucleotide'
      'gapnucleotide', 'gapdna', 'gaprna',
  - protein and nucleic are false for any other case
  ?

** Topic 4:

How do I force a sequence type?  The -sprotein and -snucleotide
command-line qualifiers are only boolean values, so there doesn't seem
to be any way to say an input is really a pureprotein.  Eg, there
could be a '-stype' qualifier, so I can do '-stype pureprotein'.

** Topic 5:

Given the existence of sequence.type, shouldn't most operations of the
form
    "@($(sequence.protein)? protein : nucleic)"
really be
    "$(sequence.type)"
?

This should allow better propogation of proper type information
through Emboss.

==

Okay, that's the sequence type related topic.  Now for some others,
first on parsing ACD files.  To get the parameter information I read
the ACD files.  There are actually two possible files to read: the
".acd" file and the file produced from the "-acdpretty" option.


** Topic 6:

Which is the prefered mechanism for getting ACD configuration
information?  There are advantages and disadvantages to either one.

  - The .acd file does not require executing a possibly arbitrary
  program to get its parameter information.  This can be a subtle
  security problem because the mechanism I'm using just does a
  system() call to see if the program exists, and has no qualms in
  running "rm-rf / && echo", which expands to the valid command
  "rm -rf / && echo -acdpretty".  By checking the acd file first,
  it eliminates that possibility, although it does require that
  the directory containing the .acd definitions be well-known.
  Is this well-known directory $EMBOSS_ACDROOT or is that a 1.x
  location?

  (The other possibility is to require that all Emboss executables and
  only Emboss executables be in a well-defined directory.  Looking at
  the standard 'configure', the is not usually the case - they get put
  into /usr/local/bin )

  - a problem with using the .acd file is that it may be out of synch
  with the actual exectuable

  - the -acdpretty option is problematical in that it writes its
  information to a file in the local directory.  My Python code cannot
  guarantee that the local directory is writeable, so I need to mkdir
  a temp directory then "cd $(tmpdir) && $(program) -acdpretty" then
  read "$(tmpdir)/$(program).acdpretty" then remove the directory.  It
  would be so much easier if -acdpretty option could write to stdout.
  (Eg, as when used as '-acdpretty -stdout')

  - the .acd file may use abbreviated names.  For example, it may have
  a qualifier as "param" instead of "parameter".  So the -acdpretty
  text is easier to parse.

I would prefer getting the ACD data directly from the executable.  Is
is possible to allows -stdout as an option to -acdpretty to make it
dump to stdout?  The other issues I can work around.

** Topic 7:

The ACD syntax definition is incomplete.  Here are some problems I ran
across.

> Comments start with "#" and continue to the end of the line.

Must the '#' be in the first character position?

The function ajacd.c:acdNoComment looks like it truncates the line at
the first '#', no matter where it is in the string, so the '#' doesn't
need to be the first character.

On the other hand, it looks like that bit of code doesn't understand
quoted strings.  Consider

% cat foo.acd
appl: foo [
        doc: "Who is #1?"
        groups: "Edit"
]

% ../acdc foo
Who is groups: "Edit
%

> Each line is parsed into tokens delimited by spaces

What is the definition of a token?  We also have that

> Parameters and qualifiers are defined by a single token followed by
> either a colon ':' (preferred) [1] or an equal sign '=' which in
> turn is followed by a second token.

This means a token cannot end in a ':' or a '='.  But it can contain a
':' outside of quotes, as in

   opt: @($(showall)?N:Y)

Or consider

% cat foo.acd
appl: foo [
        doc: A:
]

% ../acdc foo
A:
%

This means the ':' is not part of the first token in a
parameter/qualifier but is part of the second token.

Spaces aren't really the token delimiter.  The file 'wordcount.acd'
contains

  sequence: sequence [ param: Y type: dna]

so the token 'dna' is not space delimited before the ']'.  Also,
checktrans.acd uses 'min:1' which is not space delimited.


I'm trying to figure out how ajacd.c does it, but I'm getting lost in
the code.

To make thing even more confusing

% cat foo.acd
appl: f"oo [
        doc:A]B
]


% ../acdc foo
A]B
%

Also, the term 'space' in the documentation should be 'whitespace'
since it can skip '\t' characters.  Hmm, and looking at the code,
there's problems with how it skips the ':' characters.

% cat foo.acd
appl:::: foo [
        doc: "This is the doc."
]

% ../acdc foo
This is the doc.
%

And using a NUL character
% od -c foo.acd
0000000   a   p   p   l   :       f   o   o       [  \n
0000020                   d   o   c  \0   :       "   H   a   s       a
0000040       N   U   L       c   h   a   r   a   c   t   e   r   "  \n
0000060                                   S   t   r   a   n   g   e  \n
0000100   ]  \n
0000102
% ../acdc foo
Strange
%

So the parser code does not fully validate that the input data is in
the correct format.

> After the name, definitions are in mandatory square brackets, [],
> which can make a definition span multiple lines.

seqretallfeat.acd contains the following two lines

  endsection: secoutseq
  endsection: secinseq

which don't have the [].  My parse ends up special casing the
'endsection' declarations.  Would it be possible to use, say,

  endsection: secoutseq []

instead?  (Also, section and endsection are not defined anywhere in
that syntax document.)

> Tokens representing data types can be abbreviated up to the point
> where they are not ambiguous

That's a VMS-help-style shortcut.  As I recall, that has a
forward-compatibility problem.  For example, if a new data type called
'apple' is added, then 'a', 'ap', 'app', and 'appl' are no longer
unambiguous.

Has there been any consideration on how to deal with that?

> Values can be delimited (i.e. treated as one token) by any of the
> following pairs, which are stripped as the value is parsed :
>
>      '' {} () [] <>

It's not clear what a "value" means?  In this section there is

token: token [
  definition
]

But later on this the word 'attributes' is used instead of
'definition':

data_type: parameter_name

    attributes
]

and only then does it say what a value is:

> A defining attribute must have a second token representing the value
> of the attribute.

So perhaps there should be some cleanup of the definition.  (The
reason I needed to figure this out was to check that

appl: foo [
  "multiword attribute": N,
]

was indeed supposed to be illegal.)

There doesn't appear to be any way to escape a quote character inside
of a quoted token.  At least, not that I could see in the code.  So there's
no way to write something like

appl: foo [
        doc: "Remove the characters ""{}<>()'"
]

for the string
  Remove the characters "{}<>()'

Also, the doc says the valid characters are
    '' {} () [] <>
but that should include "double quotes"

And just why are there so many quote characters?

** Topic 8:

It took me a while to figure out that ajacd.c did the ACD parsing.
The file ajnam.c parses the .embosssrc and emboss.defaults which is
described in
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Usa/databases.html and is
*almost* in ACD format.  The difference is that it doesn't have an
'application' term and the 'DB' needs to be 'DB:'.

I can tweak my parser to handle the 'DB' term, but why can't those two
files really be in ACD format?  ... Although implementation-wise the
ajacd.c file uses static variables so it can only be used to parse one
file.

I noticed a couple problems with how ajname.c works.

   - It only understands a comment as a '#' in the first character
   position (while ajacd.c recognizes it anywhere).

   - The code uses "fgets(line, 512, file)" which looks like it can
   fail if the line is more than 511 characters, as with long file
   names.

(Actually, since this is a completely different implementation of the
parser, the failure conditions are different.  For example, there is a
namNoColon in ajname.c, but nothing to strip a '='.)

** Topic 9

There needs to be some clarification on the license.  When I looked at
the code I read the top-level "COPYING" file, which is the GPL.  I
have a policy not to look at GPL'ed code too closely, since I worry
that it may contaminate my ability to write equivalent non-GPL code,
like the BSDed Biopython code.  LGPL is not quite as bad, but even
then I write the non-FSF-licensed code first then if needed for
verification I look at the LGPL'ed code.

Since the top-level COPYING file is the GPL, that put me off looking
at any of the source code, even for verifying format requirements.  It
had to be pointed out to me that the ajax and nucleus codes are
covered under the LGPL.  I would not have discovered that on my own,
because it the multiple license use wasn't mentiond in the README.

In addition, I noticed the ./LICENSE is slightly different than the
current Version 2, June 1991 one from the FSF.  The FSF address is
wrong, and there are formatting changes.  I cannot tell if there are
any text changes.

I also noticed ./COPYING file is the GPL, except for a change in the
address and the exclusion of the section "How to Apply These Terms to
Your New Programs"

Shouldn't these be identical files, and match the current FSF GPL?


** Topic 10

What does the 'warnrange' attribute of an integer do?  (I've only
lightly scanned the table of data types so will likely have more
questions about the other fields in the future.)


** Topic 11

In scanning the code I noticed there is an indirection layer, which I
assume is to isolate the programmer from changes in the OS and C
library.  It isn't used everywhere.  For example, there's an
ajNamGetenv but several places call getenv directl.

I also did a scan looking for possible overflows and other security
problems.  Because of my inexperience with the indirection layer I
couldn't do an in-depth check, but I did notice that ajStrFromFloat
and ajStrFromDouble can fail on Inf, -Inf and NaN, for a couple of
reasons:

% cat inf.c
#include <stdio.h>
#include <math.h>
main() {
  /* float val = -1.0/0.0; */
  float val = strtod("-inf", NULL);
  char s[100];
  int precision = 0, ival, i;

  sprintf(s, "val == >>%.0f<<", val);
  puts(s);

  ival = abs((int) val);
  printf("ival = %d\n", ival);

  if (ival)
    i = precision + (int) log10((double)ival) + 4;
  else
    i = precision + 4;

  printf("i == %d\n", i);
}
% cc inf.c -lm
% ./a.out
val == >>-inf<<
ival = -2147483648
i == -2147483644
%

** Topic 12

Here's my first pass of the BNF for the ACD file.  There are various
things to fix, some of which are noted.  This can be used for every
file in the emoss/acd directory except qatest.acd (which contains a
syntax error that acdc doesn't catch -- the "int bint" field) and
testplot.acd (contains an '=' instead of a ':', which I don't yet
handle).

Lexer:
  colon = ":"
  open_block = "\["
  close_block = "\]"
  endsection = "endsection"
  key = "(?!endsection)[a-zA-Z0-9_]+(?=[\s:\][])"
  value = "[a-zA-Z0-9_]+(?![\s:\][])[^\s\]]* |
           [^\000-\037a-zA-Z0-9_:[\]\s][^\s\]]*"
  quoted = '"[^"]*"'   (only handles double quotes - need to fix)
  comment = "[#][^\n\r]*(\r|\r?\n)" SKIPPED
  whitespace = "\s+" SKIPPED

Parser:  (need to update the names to match the syntax doc)
  application ::= widget_list

  widget_list ::= widget |
                  widget widget_list

  widget ::= key colon key open_block arglist close_block
           | key colon key key
           | key colon key value
           | endsection colon key

  arglist ::= arg
           |  arg arglist

  arg ::= key colon key
        | key colon value

** Topic 13

One last thing.  The parameter information for the different ACD
data types is hard coded in ajacd.c.  If it was stored in an external
data file (in ACD format with well-defined fields :) then my Python
code could read that meta-information to build up its tables, rather
than me having to code it all by hand.


Hope this wasn't too much at once :)

    Andrew
    dalke at dalkescientific.com


From 962856211 at tay.ac.uk  Tue Aug  7 15:17:47 2001
From: 962856211 at tay.ac.uk (962856211 at tay.ac.uk)
Date: Tue, 7 Aug 2001 16:17:47 +0100
Subject: free downloads?
Message-ID: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk>

The list of programs you have at
http://www.uk.embnet.org/Software/EMBOSS/Apps/

is it a list of freedownloads?
Barry Marshall BSC Hons
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20010807/334a6ae7/attachment-0001.html>

From gwilliam at hgmp.mrc.ac.uk  Tue Aug  7 15:27:39 2001
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Tue, 07 Aug 2001 16:27:39 +0100
Subject: free downloads?
References: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk>
Message-ID: <3B7008EB.F4437A4F@hgmp.mrc.ac.uk>

> 962856211 at tay.ac.uk wrote:
> 
> The list of programs you have at
> http://www.uk.embnet.org/Software/EMBOSS/Apps/
> 
> is it a list of freedownloads?
> Barry Marshall BSC Hons


This is a list of the applications in the EMBOSS package.
The package can be downloaded for free (under the GPL licence) 

See:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From dmartin at bioinformatics.msiwtb.dundee.ac.uk  Tue Aug  7 16:22:31 2001
From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin)
Date: Tue, 7 Aug 2001 17:22:31 +0100 (BST)
Subject: free downloads?
In-Reply-To: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk>
Message-ID: <Pine.LNX.4.33.0108071713200.23186-100000@bioinformatics.msiwtb.dundee.ac.uk>

On Tue, 7 Aug 2001 962856211 at tay.ac.uk wrote:

> The list of programs you have at
> http://www.uk.embnet.org/Software/EMBOSS/Apps/
>
> is it a list of freedownloads?

EMBOSS is a freely downloadable package licensed under the GPL/LGPL.

You will probably want a unix/linux system on which to install it. The
admin guide describes in excruciating detail how to do this (look in the
documentation section of the web site).

If you are in Dundee (at least your email address is) then drop by if you
have any questions.

..d


----------------------------------
David Martin PhD
Bioinformatics Scientific Officer
Wellcome Trust Biocentre, Dundee
----------------------------------


From cbonnard at isrec-sg1.unil.ch  Mon Aug 20 09:09:11 2001
From: cbonnard at isrec-sg1.unil.ch (Claude Bonnard)
Date: Mon, 20 Aug 2001 11:09:11 +0200
Subject: Database access for EMBOSS
Message-ID: <10108201109.ZM13075@isrec-sg1>

Hello,

It is not very surprising that SRS is the  best mode  for a fast access to the
sequence databases from EMBOSS. As I understood, the URL mode allows the access
to a SINGLE sequence and would not support the "USA" standard (wild card query)
as SRS mode does.

If it is the case, is there a solution when the SRS server is NOT on the same
machine, but on a machine which is dedicated to SRS? I have in mind a rsh type
of request and I would like to know if someone experience this type of problem
and could help me in solving that.

Thanks a lot

Regards
Claude

-- 
Claude Bonnard Ph.D.
ISREC (Swiss Institute for Experimental Cancer Research)
Bioinformatics Group
Ch des Boveresses 155
CH-1066 Epalinges
Switzerland
phone: [41-21]-692-5891/-2236
  fax: [41-21]-652-6933


From peter.rice at uk.lionbioscience.com  Mon Aug 20 09:20:55 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 20 Aug 2001 10:20:55 +0100
Subject: Database access for EMBOSS
References: <10108201109.ZM13075@isrec-sg1>
Message-ID: <3B80D677.99A6DCC8@uk.lionbioscience.com>

Claude Bonnard wrote:
> It is not very surprising that SRS is the  best mode  for a fast access
> to the sequence databases from EMBOSS. As I understood, the URL mode
> allows the access to a SINGLE sequence and would not support the
> "USA" standard (wild card query) as SRS mode does.

True. We could add an "SRSREMOTE" access mode to extend queries, easy to
program but maybe limited practical use.
 
> If it is the case, is there a solution when the SRS server is NOT
> on the same machine, but on a machine which is dedicated to SRS?
> I have in mind a rsh type of request and I would like to know if
> someone experience this type of problem and could help me in
> solving that.

SRS access mode allows you to define the name of the getz program.

How about an alternative name that is a script, and uses rsh to run a
remote getz and returns the results?

For example, if your script is called 'remotegetz' just add this to the
database definition:

app: remotegetz

(you can use the full path if needed)

Note: This was originally added because the Sanger Centre ran 2 versions of
SRS (5.1 and 6.0) and I needed to switch between them, but it has other
possible uses.


-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From dmartin at bioinformatics.msiwtb.dundee.ac.uk  Mon Aug 20 09:34:19 2001
From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin)
Date: Mon, 20 Aug 2001 10:34:19 +0100 (BST)
Subject: Database access for EMBOSS
In-Reply-To: <3B80D677.99A6DCC8@uk.lionbioscience.com>
Message-ID: <Pine.LNX.4.33.0108201030440.6257-100000@bioinformatics.msiwtb.dundee.ac.uk>

On Mon, 20 Aug 2001, Peter Rice wrote:

> Claude Bonnard wrote:
> > It is not very surprising that SRS is the  best mode  for a fast access
> > to the sequence databases from EMBOSS. As I understood, the URL mode
> > allows the access to a SINGLE sequence and would not support the
> > "USA" standard (wild card query) as SRS mode does.
>
> True. We could add an "SRSREMOTE" access mode to extend queries, easy to
> program but maybe limited practical use.
>
> > If it is the case, is there a solution when the SRS server is NOT
> > on the same machine, but on a machine which is dedicated to SRS?
> > I have in mind a rsh type of request and I would like to know if
> > someone experience this type of problem and could help me in
> > solving that.
>
> SRS access mode allows you to define the name of the getz program.
>
> How about an alternative name that is a script, and uses rsh to run a
> remote getz and returns the results?
>
> For example, if your script is called 'remotegetz' just add this to the
> database definition:
>
> app: remotegetz
>
> (you can use the full path if needed)
>
> Note: This was originally added because the Sanger Centre ran 2 versions of
> SRS (5.1 and 6.0) and I needed to switch between them, but it has other
> possible uses.

This would then allow one to add whichever script one wanted as long as it
could parse srs style arguements..

It doesn't have to be SRS, just look like it.. The potential is there for
wrapping in house rdbms with such a script.

I'll add some more comments to the admin guide if Peter can send me
details of how EMBOSS calls the wgetz program (not being much of an srs
hacker myself).


..d


----------------------------------
David Martin PhD
Bioinformatics Scientific Officer
Wellcome Trust Biocentre, Dundee
----------------------------------


From peter.rice at uk.lionbioscience.com  Mon Aug 20 09:50:22 2001
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 20 Aug 2001 10:50:22 +0100
Subject: Database access for EMBOSS
References: <Pine.LNX.4.33.0108201030440.6257-100000@bioinformatics.msiwtb.dundee.ac.uk>
Message-ID: <3B80DD5E.F61846E1@uk.lionbioscience.com>

David Martin wrote:
> 
> On Mon, 20 Aug 2001, Peter Rice wrote:
> > For example, if your script is called 'remotegetz' just add this to the
> > database definition:
> >
> > app: remotegetz
> 
> This would then allow one to add whichever script one wanted as long
> as it could parse srs style arguements..
> 
> It doesn't have to be SRS, just look like it.. The potential is there
> for wrapping in house rdbms with such a script.
> 
> I'll add some more comments to the admin guide if Peter can send me
> details of how EMBOSS calls the wgetz program (not being much of an srs
> hacker myself).

This is getz, not wgetz. It supports the full SRS query language because it
calls getz (or a user defined script) with an SRS query constructed from
the USA.

But there is also an access method in general for external applications.
You can use this to set up RDBMS calls - which anyway was the original
intention.

At present it picks up dbname:id or dbname:acc as the rest of the command
line, or puts the id/accession into a formatted string (if the application
definition includes %s), but can easily be adapted further.

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From cutler at tularik.com  Mon Aug 27 18:34:15 2001
From: cutler at tularik.com (Gene Cutler)
Date: Mon, 27 Aug 2001 11:34:15 -0700
Subject: drawing trees
Message-ID: <a05101006b7b041c61d1e@[192.168.50.41]>

Hello, all.  I have a question about phylogenetic-type trees for 
sequences.  I haven't quite figured out how to do this using 
emboss/phylip.  This is how I have been doing this with gcg:

perform multiple sequence alignment (generally hmmalign or clustalw)
convert file to msf format if not already (e.g., sreformat from hmmer package)
run gcg program distances on the msf file
run gcg program growtree on the distances file
end up with a postscript file

How would I do this with PHYLIP instead?

Thanks.


From stein at fieldmuseum.org  Mon Aug 27 19:20:00 2001
From: stein at fieldmuseum.org (Jennifer Steinbachs)
Date: Mon, 27 Aug 2001 14:20:00 -0500 (CDT)
Subject: drawing trees
In-Reply-To: <a05101006b7b041c61d1e@[192.168.50.41]>
Message-ID: <Pine.LNX.4.33.0108271412360.30944-100000@mail.fmnh.org>


Use your favourite alignment program...
Put your aligned sequences into PHYLIP format
Run the appropriate phylip program...
    distance-based methods:
	protdist (for proteins)
	dnadist (for dna)
    parsimony
	protpars
	dnapars
    likelihood
	dnaml or dnamlk
	protml

See the phylip website
for more info (http://evolution.genetics.washington.edu/phylip.html).

If you aren't certain of the differences between the different
tree-building algorithms, you should get your hands on Nei and Kumar 2000
Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
reference for phylogenetics is Hillis et al 1996 Molecular Systematics
ISBN 0-87893-282-8 Chapter 11.

-jennifer


On Mon, 27 Aug 2001, Gene Cutler wrote:

 >Hello, all.  I have a question about phylogenetic-type trees for
 >sequences.  I haven't quite figured out how to do this using
 >emboss/phylip.  This is how I have been doing this with gcg:
 >
 >perform multiple sequence alignment (generally hmmalign or clustalw)
 >convert file to msf format if not already (e.g., sreformat from hmmer package)
 >run gcg program distances on the msf file
 >run gcg program growtree on the distances file
 >end up with a postscript file
 >
 >How would I do this with PHYLIP instead?
 >
 >Thanks.
 >
 >


-----------------------------------
J. Steinbachs, Ph.D.
Computational Biologist
http://compbiology.org
-----------------------------------


From cutler at tularik.com  Mon Aug 27 20:19:56 2001
From: cutler at tularik.com (Gene Cutler)
Date: Mon, 27 Aug 2001 13:19:56 -0700
Subject: drawing trees
In-Reply-To: <Pine.LNX.4.33.0108271412360.30944-100000@mail.fmnh.org>
References: <Pine.LNX.4.33.0108271412360.30944-100000@mail.fmnh.org>
Message-ID: <a05101017b7b05a35d728@[192.168.50.41]>

Thanks Jennifer.  One more question:

>Put your aligned sequences into PHYLIP format

I didn't see any information on the phylip webpage about phylip format
and/or conversion tools.  I can do the conversion myself if I can find
documentation on the format.

>If you aren't certain of the differences between the different
>tree-building algorithms, you should get your hands on Nei and Kumar 2000
>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
>reference for phylogenetics is Hillis et al 1996 Molecular Systematics
>ISBN 0-87893-282-8 Chapter 11.

That's useful too.  Thanks again.


-- 

-=-=-=-=-=-=-=-=-=-=-=-=-=-
Gene Cutler
Bioinformatics Scientist
cutler at tularik.com
- - - - - - - - - - - - -
Tularik Inc
2 Corporate Drive
South San Francisco, CA 94080, USA
http://www.tularik.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-


From stein at fieldmuseum.org  Mon Aug 27 21:09:55 2001
From: stein at fieldmuseum.org (Jennifer Steinbachs)
Date: Mon, 27 Aug 2001 16:09:55 -0500 (CDT)
Subject: drawing trees
In-Reply-To: <a05101017b7b05a35d728@[192.168.50.41]>
Message-ID: <Pine.LNX.4.33.0108271602370.31408-100000@mail.fmnh.org>


I like to use Seaview to make the conversion (I don't have the website
handy but a google search should produce it quickly).  ClustalX (and maybe
clustalw) also produce Phylip files.  I thought the Phylip website had
information on the format, but it's been a while since I've actually
perused the documentation.  The phylip docs should definitely have
complete information.

If I recall correctly, it is something like:

#sequences #nucleotide_sites
sequence_name sequence
sequence_name sequence

etc.

There used to be a 10 character limit on sequence_name, but I don't know
if that holds with the latest version - I use PAUP* mostly for my
analyses.  Sequence can be non-interleaved or interleaved.

-jennifer

On Mon, 27 Aug 2001, Gene Cutler wrote:

 >Thanks Jennifer.  One more question:
 >
 >>Put your aligned sequences into PHYLIP format
 >
 >I didn't see any information on the phylip webpage about phylip format
 >and/or conversion tools.  I can do the conversion myself if I can find
 >documentation on the format.
 >
 >>If you aren't certain of the differences between the different
 >>tree-building algorithms, you should get your hands on Nei and Kumar 2000
 >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
 >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics
 >>ISBN 0-87893-282-8 Chapter 11.
 >
 >That's useful too.  Thanks again.
 >
 >
 >

-- 
-----------------------------------
J. Steinbachs, Ph.D.
Computational Biologist
http://compbiology.org
-----------------------------------


From jrvalverde at cnb.uam.es  Tue Aug 28 06:05:15 2001
From: jrvalverde at cnb.uam.es (jrvalverde at cnb.uam.es)
Date: Tue, 28 Aug 2001 08:05:15 +0200 (DST)
Subject: drawing trees
In-Reply-To: <a05101017b7b05a35d728@[192.168.50.41]>
Message-ID: <200108280605.f7S65GE1348757@embnet.cnb.uam.es>

Gene Cutler <cutler at tularik.com> wrote:
> Thanks Jennifer.  One more question:
> 
> >Put your aligned sequences into PHYLIP format
> 
> I didn't see any information on the phylip webpage about phylip format
> and/or conversion tools.  I can do the conversion myself if I can find
> documentation on the format.

It's on the package documentation, but if you are already using CLUSTAL,
then simply go to "multiple alignments", choose "output options" and
then select "PHYLIP" format.

As for the details, either look at the documentation ("main.doc") or
to EMBnet's Quick Guide to PHYLIP (PDF and HTML versions may still be
found at
	http://www.es.embnet.org/~pprpc/activs/PHYLIPGuide/PhylipGuide-1.6.html

				j


From frank at bioss.ac.uk  Tue Aug 28 07:45:33 2001
From: frank at bioss.ac.uk (Frank Wright)
Date: Tue, 28 Aug 2001 08:45:33 +0100
Subject: drawing trees
References: <a05101006b7b041c61d1e@[192.168.50.41]>
Message-ID: <3B8B4C1D.40315E3C@bioss.ac.uk>

Hi Gene,

>I didn't see any information on the phylip webpage about phylip format
>and/or conversion tools.  I can do the conversion myself if I can find
>documentation on the format.

PHYLIP FORMAT
-------------

PHYLIP format is discussed in the PHYLIP documentation "main" file:

   http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html

See section 6.Overview of the input and output formats/ subsection
1.Input File Format 
See the PHYLIP "sequences" documentation for details of how PHYLIP codes
unknowns and gaps:

   http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html 

READSEQ
-------
The READSEQ program is very useful for converting between alignment
formats (you may have to edit "." to be "-" though as PHYLIP codes gaps
differently). 

ftp://ftp.bio.indiana.edu/molbio/readseq/

To use:

  Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a

There is an old version and a Java version.  I prefer the old one :-).


Other programs
--------------

Jennifer Steinbachs has suggested some programs for reformatting
alignments.  Here's some additional comments:

There are other programs that can reformat alignment files.  CLUSTALW
can be used to read in alignments (option 1) and write them out (option
2, suboption 9) without doing an alignment.  I've not used SEQIO but it
looks useful:

http://bioweb.pasteur.fr/docs/seqio/seqio.html

On a PC you could use "export" and "import" facilities in GENEDOC, an
excellent alignment editor.

http://www.psc.edu/biomed/genedoc/


Best Wishes,
Frank
--
Frank Wright
Biomathematics and Statistics Scotland, 
SCRI, DUNDEE DD2 5DA, Scotland
frank at bioss.sari.ac.uk


From letondal at pasteur.fr  Tue Aug 28 07:51:50 2001
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue, 28 Aug 2001 09:51:50 +0200
Subject: drawing trees 
In-Reply-To: Your message of "Tue, 28 Aug 2001 08:45:33 BST."
             <3B8B4C1D.40315E3C@bioss.ac.uk> 
Message-ID: <200108280751.f7S7poM220418@electre.pasteur.fr>


Frank Wright wrote:
> Hi Gene,
> 
> >I didn't see any information on the phylip webpage about phylip format
> >and/or conversion tools.  I can do the conversion myself if I can find
> >documentation on the format.

Our Phylip Web server (http://bioweb/seqanal/phylogeny/phylip-uk.html) may help
with format conversion as well as phylogenetic programs chaining.

> 
> PHYLIP FORMAT
> -------------
> 
> PHYLIP format is discussed in the PHYLIP documentation "main" file:
> 
>    http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html
> 
> See section 6.Overview of the input and output formats/ subsection
> 1.Input File Format 
> See the PHYLIP "sequences" documentation for details of how PHYLIP codes
> unknowns and gaps:
> 
>    http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html 
> 
> READSEQ
> -------
> The READSEQ program is very useful for converting between alignment
> formats (you may have to edit "." to be "-" though as PHYLIP codes gaps
> differently). 
> 
> ftp://ftp.bio.indiana.edu/molbio/readseq/
> 
> To use:
> 
>   Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a
> 
> There is an old version and a Java version.  I prefer the old one :-).
> 
> 
> Other programs
> --------------
> 
> Jennifer Steinbachs has suggested some programs for reformatting
> alignments.  Here's some additional comments:
> 
> There are other programs that can reformat alignment files.  CLUSTALW
> can be used to read in alignments (option 1) and write them out (option
> 2, suboption 9) without doing an alignment.  I've not used SEQIO but it
> looks useful:
> 
> http://bioweb.pasteur.fr/docs/seqio/seqio.html
> 
> On a PC you could use "export" and "import" facilities in GENEDOC, an
> excellent alignment editor.
> 
> http://www.psc.edu/biomed/genedoc/
> 
> 
> 
> Best Wishes,
> Frank
> --
> Frank Wright
> Biomathematics and Statistics Scotland, 
> SCRI, DUNDEE DD2 5DA, Scotland
> frank at bioss.sari.ac.uk

--
Catherine Letondal -- Pasteur Institute Computing Center


From frank at bioss.ac.uk  Tue Aug 28 08:04:13 2001
From: frank at bioss.ac.uk (Frank Wright)
Date: Tue, 28 Aug 2001 09:04:13 +0100
Subject: drawing trees
References: <a05101006b7b041c61d1e@[192.168.50.41]>
Message-ID: <3B8B507D.B7639FE5@bioss.ac.uk>

Hi All,

Gene Cutler asked:

>Hello, all.  I have a question about phylogenetic-type trees for 
>sequences.  I haven't quite figured out how to do this using 
>emboss/phylip.  This is how I have been doing this with gcg:
>
>run gcg program distances on the msf file
>run gcg program growtree on the distances file
>
>How would I do this with PHYLIP instead?

The GCG DISTANCES program and GCG GROWTREE programs are very similar to
the DNADIST/PROTDIST and Neighbor programs in PHYLIP.  In other words,
they allow phylogenetic trees to be constructed using "distance-based"
methods, but do not allow maximum likelihood or parsimony methods to be
used.  They also don't do bootstrapping tests, tree comparisons, and
lots of other things.

If you are using distance-based phylogenetic methods, some notes:

(1) Weighted least-squares is slower but more accurate than
Neighbor-Joining, so use the PHYLIP FITCH program instead of NEIGHBOR.

(2) Recently, the PHYLIP-like WEIGHBOR program (Weighted
Neighbor-Joining) has been released.  WEIGHBOR appears to be an
improvement on Neighbor-Joining (and possibly weighted least squares). 
See http://www.t10.lanl.gov/billb/weighbor/.  I've not tried it out much
but the simulations in the published paper look convincing.

(3) PHYLIP (version 3.6) has improved DNADIST (more distance methods)
and PROTDIST (rate heterogeneity among sites added).

Best Wishes,
Frank
-- 
Frank Wright
Biomathematics and Statistics Scotland, 
SCRI, DUNDEE DD2 5DA, Scotland
frank at bioss.sari.ac.uk


From dmartin at bioinformatics.msiwtb.dundee.ac.uk  Tue Aug 28 08:26:00 2001
From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin)
Date: Tue, 28 Aug 2001 09:26:00 +0100 (BST)
Subject: drawing trees
In-Reply-To: <Pine.LNX.4.33.0108271602370.31408-100000@mail.fmnh.org>
Message-ID: <Pine.LNX.4.33.0108280923580.7400-100000@bioinformatics.msiwtb.dundee.ac.uk>


Remember that in the Phylip programs distributed as an EMBASSY package,
EMBOSS will do the sequence conversions for you.

The programs to run are the same name with an e prepended. It also has the
various options in ACD format so the programs can be fully scripted. The
two programs that haven't been EMBOSSised are DRAWTREE and DRAWGRAM.

..d

On Mon, 27 Aug 2001, Jennifer Steinbachs wrote:

>
> I like to use Seaview to make the conversion (I don't have the website
> handy but a google search should produce it quickly).  ClustalX (and maybe
> clustalw) also produce Phylip files.  I thought the Phylip website had
> information on the format, but it's been a while since I've actually
> perused the documentation.  The phylip docs should definitely have
> complete information.
>
> If I recall correctly, it is something like:
>
> #sequences #nucleotide_sites
> sequence_name sequence
> sequence_name sequence
>
> etc.
>
> There used to be a 10 character limit on sequence_name, but I don't know
> if that holds with the latest version - I use PAUP* mostly for my
> analyses.  Sequence can be non-interleaved or interleaved.
>
> -jennifer
>
> On Mon, 27 Aug 2001, Gene Cutler wrote:
>
>  >Thanks Jennifer.  One more question:
>  >
>  >>Put your aligned sequences into PHYLIP format
>  >
>  >I didn't see any information on the phylip webpage about phylip format
>  >and/or conversion tools.  I can do the conversion myself if I can find
>  >documentation on the format.
>  >
>  >>If you aren't certain of the differences between the different
>  >>tree-building algorithms, you should get your hands on Nei and Kumar 2000
>  >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7.  The other good
>  >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics
>  >>ISBN 0-87893-282-8 Chapter 11.
>  >
>  >That's useful too.  Thanks again.
>  >
>  >
>  >
>
>

----------------------------------
David Martin PhD
Bioinformatics Scientific Officer
Wellcome Trust Biocentre, Dundee
----------------------------------


From pscotney at hotmail.com  Sat Aug 25 13:55:16 2001
From: pscotney at hotmail.com (Pierre Scotney)
Date: Sat, 25 Aug 2001 23:55:16 +1000
Subject: [EMBOSS] EMBOSS and Jemboss installation problems SOLVED!
Message-ID: <F9sn2Yf0scL9NtDEJmG0000037e@hotmail.com>


Hello!

I have solved the GNU/Linux EMBOSS and Jemboss installation problems :)

The solution was:

1) edit both /etc/profile (for bash) and /etc/csh.login (for csh) so that 
$PATH includes /usr/local/lib/j2sdk1.4.2/bin path, previously only bash had 
the correct path to the java binaries.

2) use j2sdk1.4.2 as EMBOSS-2.8.0 will not build with j2sdk1.3.1 (Blackdown 
Java-Linux).  May be the documentation/scripts will need to be changed to 
reflect this issue.

Cheers

Pierre

--
Dr Pierre Scotney
Melbourne
Australia

_________________________________________________________________
Get Extra Storage in 10MB, 25MB, 50MB and 100MB options now! Go to  
http://join.msn.com/?pgmarket=en-au&page=hotmail/es2