From g38909015 at mailsrv.ym.edu.tw Wed Aug 1 01:06:37 2001 From: g38909015 at mailsrv.ym.edu.tw (TerryYeh-YM) Date: Wed, 1 Aug 2001 13:06:37 +0800 Subject: Join mailing list Message-ID: <000a01c11a47$be86f3c0$46146e8c@nchc.gov.tw> ------------------------------------------------------ Chang-Wei Yeh (Terry Yeh) National Yang Ming University College of Life Science Institute of Anatomy and Cell Biology Bioinformatics Program and Core Lab ------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20010801/59162129/attachment.html From simon.andrews at bbsrc.ac.uk Wed Aug 1 08:56:08 2001 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Wed, 1 Aug 2001 13:56:08 +0100 Subject: [EMBOSS] Getting headers from Seqret Message-ID: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk> [sent to Emboss mailing list] Dear All, I'm having trouble getting header information back through seqret, from a database formatted using dbiflat against a genbank flat file (refseq actually). I'm sure plenty of people must have done this before, but I've read through the documentation, and I can't see where I'm going wrong! The database formatted OK, and I can fetch sequences back from it, but at some point I will need to retrieve the entire header from the original file to get at some of the extra information in there (feature tables, cross references, authors etc). I've tried several different output USAs with seqret, but the most I can seem to get back is the name, accession number and description. I can't believe that this information is thrown away by seqret (it's still there in the flat file after all), so how can I retrieve it? Thanks for any help Simon [Potentially useful details follow] ---- Simon Andrews PhD Bioinformatics Dept The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0)1223 496463 ########################################################################## Emboss version = 2.0.0 Platform = DEC alpha (OSF1 v4.0) My emboss.default entry for the database looks like; DB refseq [ type: N method: emblcd format: gb dir: /usr/users/andrewss/Refseq/Genbank file: "*.gbff" release: "1.0" comment: "Refseq Hum Mus Rat" ] and an example of the output of seqret with a debug USA is (with the documentation space suspiciously blank!); Sequence output trace ===================== Name: 'NM_031360' Accession: 'NM_031360' Description: 'Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA.' Type: 'N' Database: 'refseq' Full name: '' Date: '' Usa: 'debug::test.seq' Ufo: '' Input format: 'gb' Output format: 'debug' Filename: 'test.seq' Entryname: 'NM_031360' File name: 'test.seq' Extension: 'fasta' Single: 'No' Features: 'No' Count: 'No' Documentation:... 1 atgaagcaca acttttctct gcggctgagg gttttcaacc tcaactgctg 50 51 ggacatcccc tacctaagca agcatagggc cgaccgcatg aagcgcttgg 100 etc. The extra stuff I'm after is this sort of thing; LOCUS NM_031360 1269 bp mRNA ROD 12-JUN-2001 DEFINITION Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA. ACCESSION NM_031360 VERSION NM_031360.1 GI:14389300 KEYWORDS . SOURCE Norway rat. ORGANISM Rattus norvegicus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. REFERENCE 1 (sites) AUTHORS Mizutani,Y., Tamiya-Koizumi,K., Irie,F., Hirabayashi,Y., Miwa,M. and Yoshida,S. TITLE Cloning and expression of rat neutral sphingomyelinase: enzymological characterization and identification of essential histidine residues JOURNAL Biochim. Biophys. Acta 1485 (2-3), 236-246 (2000) MEDLINE 20292884 COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AB047002.1. FEATURES Location/Qualifiers source 1..1269 /organism="Rattus norvegicus" /strain="Sprague-Dawley" /db_xref="taxon:10116" /chromosome="X" /chromosome="14" /chromosome="2" /chromosome="3" /chromosome="17" /map="Xq28" /map="14q" /map="2 36.0 cM" /map="Xq11.1" /map="3" /map="17q12-q21" /sex="male" /tissue_type="liver" /clone_lib="rat liver lambda cDNA library (STRATAGENE,#936513)" gene 1..1269 /gene="Smpd2" /note="EBS3; EBS4; K14; CK; MAGE5; MAGE10; Tdo; Araf" /db_xref="LocusID:83537" /db_xref="MGD:MGI:98246" /db_xref="MIM:148066" /db_xref="MIM:300340" /db_xref="MIM:300343" /db_xref="MIM:601443" /db_xref="RATMAP:36372" /db_xref="RGD:36372" CDS 1..1269 /gene="Smpd2" /note="lyso-platelet activating factor-phospholipase C; cytokeratin 14; Raf related protein; Synaptosomal-associated protein" /codon_start=1 /db_xref="LocusID:83537" /db_xref="MGD:MGI:98246" /db_xref="MIM:148066" /db_xref="MIM:300340" /db_xref="MIM:300343" /db_xref="MIM:601443" /db_xref="RATMAP:36372" /db_xref="RGD:36372" /product="neutral sphingomyelinase" /protein_id="NP_112650.1" /db_xref="GI:14389301" /translation="MKHNFSLRLRVFNLNCWDIPYLSKHRADRMKRLGDFLNLESFDL ALLEEVWSEQDFQYLKQKLSLTYPDAHYFRSGIIGSGLCVFSRHPIQEIVQHVYTLNG YPYKFYHGDWFCGKAVGLLVLHLSGLVLNAYVTHLHAEYSRQKDIYFAHRVAQAWELA QFIHHTSKKANVVLLCGDLNMHPKDLGCCLLKEWTGLRDAFVETEDFKGSEDGCTMVP KNCYVSQQDLGPFPFGVRIDYVLYKAVSGFHICCKTLKTTTGCDPHNGTPFSDHEALM ATLCVKHSPPQEDPCSAHGSAERSALISALREARTELGRGIAQARWWAALFGYVMILG LSLLVLLCVLAAGEEAREVAIMLWTPSVGLVLGAGAVYLFHKQEAKSLCRAQAEIQHV LTRTTETQDLGSEPHPTHCRQQEADRAEEK" misc_feature 91..837 /note="AP_endonucleas1; Region: AP endonuclease family 1" From peter.rice at uk.lionbioscience.com Wed Aug 1 09:12:57 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Wed, 01 Aug 2001 14:12:57 +0100 Subject: [EMBOSS] Getting headers from Seqret References: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk> Message-ID: <3B680059.74B4594@uk.lionbioscience.com> "simon andrews (BI)" wrote: > The database formatted OK, and I can fetch sequences back from it, but > at some point I will need to retrieve the entire header from the > original file to get at some of the extra information in there > (feature tables, cross references, authors etc). > > I've tried several different output USAs with > seqret, but the most I can seem to get back is the name, accession number > and description. It all depends on how much information we store in the internal data structures. As standard, we keep the ID, Accession, Description and sequence so we can write a FASTA format file easily. We also keep the complete feature table, but only optionally. seqret ignores it, but seqretallfeat reads and writes it. Most programs only need the sequence data and parsing feature information wastes time and space on large sequences. We can also read the entire text of an entry with entret, assuming you want the original flatfile format. >I can't believe that this information is thrown away by seqret > (it's still there in the flat file after all), Yes, it is (but we can easily read more fields - the problem is whether we can convert them to other file formats easily) > so how can I retrieve it? Using entret - which sounds like the solution you need. -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From ableasby at hgmp.mrc.ac.uk Wed Aug 1 13:15:25 2001 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Wed, 1 Aug 2001 18:15:25 +0100 (BST) Subject: EMBOSS patchfiles directory Message-ID: <200108011715.SAA26106@bromine.hgmp.mrc.ac.uk> Just a reminder that, between EMBOSS releases, occasional bugfixes are placed in the directory: ftp://ftp.uk.embnet.org/pub/EMBOSS/patchfiles/ There are currently two replacement files in that directory. marscan.c showfeat.c Both are replacements for applications in the EMBOSS-2.0.1/emboss directory. Alan From gbottu at ben.vub.ac.be Thu Aug 2 13:00:02 2001 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Thu, 2 Aug 2001 19:00:02 +0200 (MET DST) Subject: databanks in PIR format Message-ID: <200108021700.TAA24275@bigben.vub.ac.be> from : BEN Dear colleagues, Has anybody already successfully accessed databanks in PIR NBRF or CODATA format under EMBOSS ? I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is indexed under SRS). My emboss.default file contains : DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant' methodquery: srs dbalias: PIR_NR methodall: direct dir: /seq/protein/flat file: pir_nr.seq ] But this does not work. E.g. seqret pir_nr:e69549 gives an output file : >E69549 conserved hypothetical protein AF2396 - Archaeoglobus fulgidus >E69549 MTVVPLSALREGQEGRVVAINGGRGCTARLMSMGIVPGKKIRIAGRRGGAVLVSVNGTKF VIGRGLAMKVAVDVGEQG Guy Bottu From peter.rice at uk.lionbioscience.com Thu Aug 2 13:28:29 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Thu, 02 Aug 2001 18:28:29 +0100 Subject: databanks in PIR format References: <200108021700.TAA24275@bigben.vub.ac.be> Message-ID: <3B698DBD.9984ED23@uk.lionbioscience.com> Guy Bottu wrote: > I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is > indexed under SRS). My emboss.default file contains : > > DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant' > methodquery: srs dbalias: PIR_NR > methodall: direct dir: /seq/protein/flat file: pir_nr.seq > ] > > But this does not work. E.g. seqret pir_nr:e69549 gives an output file This is because of problems in SRS converting PIR entries to PIR format. This has been the same since the days of SRS 5, but I have passed it on to the support guys here to take a look. Seems nobody has been retrieving PIR entries in their original format. For example, see PIR on the SRS 5 server at MIPS: http://srs-mips.gsf.de/srs5bin/cgi-bin/wgetz?-id+2trYB1GreRI+-e+[PIR-ID:'E69549'] You can get queries to work with: DB pir_nr [ type: P format: fasta comment: 'PIR nonredundant' methodquery: srsfasta dbalias: PIR_NR methodall: direct dir: /seq/protein/flat file: pir_nr.seq ] ... but the fasta format required for srsfasta will not let you work with direct access to all entries. srs access does getz -e srsfasta access does getz -d -sf fasta regards, Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From peter.rice at uk.lionbioscience.com Thu Aug 2 13:59:20 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Thu, 02 Aug 2001 18:59:20 +0100 Subject: databanks in PIR format References: <200108021700.TAA24275@bigben.vub.ac.be> <3B698DBD.9984ED23@uk.lionbioscience.com> Message-ID: <3B6994F8.F4A5A403@uk.lionbioscience.com> >This is because of problems in SRS converting PIR entries to PIR format. >This has been the same since the days of SRS 5, but I have passed it on to >the support guys here to take a look. Quick fix would be to change the format in pir.i to be "plain" and run srssection. This gives PIR format without the trailing * but is good enough to make EMBOSS happy. Then Guy's original definition should work. regards, Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From gbottu at ben.vub.ac.be Fri Aug 3 05:43:57 2001 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 3 Aug 2001 11:43:57 +0200 (MET DST) Subject: databanks in PIR format Message-ID: <200108030943.LAA11981@bigben.vub.ac.be> >Quick fix would be to change the format in pir.i to be "plain" and run >srssection. > >This gives PIR format without the trailing * but is good enough to make >EMBOSS happy. Then Guy's original definition should work. > I tried and it worked ! Thanks for the advice. Still, there must be some nasty bug hidden in the SRS code, since similar problem does not occur with EMBL and GenBank formats. Let's hope they can fix it. Guy Bottu From gbottu at ben.vub.ac.be Fri Aug 3 08:48:02 2001 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 3 Aug 2001 14:48:02 +0200 (MET DST) Subject: problem with remote databank access Message-ID: <200108031248.OAA26689@bigben.vub.ac.be> from : BEN Dear support, While experimenting with remote databank access I noticed the following : DB GENBANK [ type: N format: genbank method: url comment: 'GenBank at Institut Pasteur (Paris, France)' url: "http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]" ] does work fine. However, with : DB GENBANK [ type: N format: genbank method: url comment: 'GenBank at DKFZ (Heidelberg, Germany)' url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genbank- acc:%s]" ] seqret genbank:X15320 retrieves a file : >ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC 6.1.1.19 The problem is probably that at the DKFZ they index the databank in GCG format. However, replacing "format: genbank" by "format: gcg" does not work. Guy Bottu From jackl at dalicon.com Fri Aug 3 09:11:41 2001 From: jackl at dalicon.com (Jack Leunissen) Date: Fri, 3 Aug 2001 15:11:41 +0200 Subject: problem with remote databank access References: <200108031248.OAA26689@bigben.vub.ac.be> Message-ID: <009001c11c1d$d74aaff0$0400a8c0@cmbipc32> No, the problem is that their default output format is EMBL! And that seems to upset EMBOSS, as it expect GENBANK format for the sequence information too. Changing the call to: url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+-sf+g enbank+[genbank-acc:%s]" does the trick! (note the addition: +-sf+genbank" to force the sequence output in GENBANK format). Cheers, Jack Jack A.M. Leunissen Email: jackl at cmbi.kun.nl Centre for Molecular and Tel : +31 24 365 22 48 Biomolecular Informatics Fax : +31 24 365 29 77 Nijmegen, Netherlands http://www.cmbi.kun.nl/ ----- Original Message ----- From: "Guy Bottu" To: Cc: ; Sent: Friday, August 03, 2001 2:48 PM Subject: problem with remote databank access > from : BEN > > Dear support, > > While experimenting with remote databank access I noticed the following : > > DB GENBANK [ type: N format: genbank method: url > comment: 'GenBank at Institut Pasteur (Paris, France)' > url: "http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]" > ] > > does work fine. However, with : > > DB GENBANK [ type: N format: genbank method: url > comment: 'GenBank at DKFZ (Heidelberg, Germany)' > > url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genb ank- > acc:%s]" > ] > > seqret genbank:X15320 retrieves a file : > > >ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC > 6.1.1.19 > > The problem is probably that at the DKFZ they index the databank in GCG format. > However, replacing "format: genbank" by "format: gcg" does not work. > > Guy Bottu > > From peter.rice at uk.lionbioscience.com Fri Aug 3 11:07:39 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Fri, 03 Aug 2001 16:07:39 +0100 Subject: databanks in PIR format References: <200108030943.LAA11981@bigben.vub.ac.be> Message-ID: <3B6ABE3B.5EBE5C2C@uk.lionbioscience.com> Guy Bottu wrote: > >Quick fix would be to change the format in pir.i to be "plain" and run > >srssection. > > I tried and it worked ! Thanks for the advice. > > Still, there must be some nasty bug hidden in the SRS code, since similar > problem does not occur with EMBL and GenBank formats. Let's hope they > can fix it. "It's not a bug, it's a feature" As it has been there since SRS 5.0 (at least) requres changes to the C source code (so that PIR format behaves the same way as EMBL) it will have to wait for a future release. Meanwhile, the plain fix will work well enough - some software may want a trailing '*' but probably most programs will be happy. Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From dalke at dalkescientific.com Sun Aug 5 20:52:59 2001 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 6 Aug 2001 01:52:59 +0100 Subject: questions about ACD format Message-ID: <005101c11e12$470a8180$0201a8c0@josiah.dalkescientific.com> [Brief summary: I'm trying to integrate Emboss with Biopython and found that 1) not enough sequence type information is available in the ACD file for Biopython's AlphabetStrict code to work, so I have a proposal to fix the, 2) I have questions about how to interpret some of the documentation, 3) there are places where the Emboss ACD parser doesn't appear to work correctly, and 4) general observations on the ACD format and on the implementation.] Hello, First off, my apologies if this is the wrong email address for this topic. I couldn't find any archives to scan for verification. I am also not a member of this list, so please cc me on any replies. Based on the feedback I got from some people at ISMB, I've started a Python interface to EMBOSS. The goal is to be able to do something like: >>> from Bio import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.Emboss import apps >>> >>> seq = Seq.Seq("AATCCATCGATGCAC", IUPAC.unambiguous_dna) >>> results = apps.revseq(sequence = seq) >>> results["outseq"] Emboss.EmbossSeq("GTGCATCGATGGATT", IUPAC.ambiguous_dna) >>> I can almost, but not quite do this, for some reasons I'll describe shortly. Here are the questions and problems I had in doing this, as well as some specific feature I would like to see added, which I feel may make it easier to integrate EMBOSS with other systems. ====== ** Topic 1 As you can see in the above example, there is some automatic conversion going on. One is to convert the Biopython 'Seq' object to a temporary file, so it can be used with the '-sequence' parameter needed by revseq. This is done by knowing how to convert the Seq object to a 'seqall' Emboss type, including looking at the 'type' field to ensure that the input sequence is really DNA. The conversion step requires that I do a verification of the Biopython Seq Alphabet to the Emboss sequence 'type'. There is a description of the types in the syntax document, at http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html but it doesn't describe: 1) what is used as a gap character? (I assume '-') 2) what is used for a stop character? (I assume '*') 3) are selenocysteines encoded with a U? (the pureprotein definition says it excludes "BZ or X", so I'm guessing selenocysteines aren't allowed - or are they encoded as X?) 4) shouldn't there be a gapstopprotein? ** Topic 2 Another conversion is to create a temporary filename for the -outseq parameter, based on the 'seqoutall' Emboss type. I would like to read the contents of this file into a Biopython Seq object, however, the ACD description does not contain enough information for me to do that. Instead, I can only create the tempfile and store the filename in the "outseq" parameter. Could a new 'type' parameter be added to 'seqoutall'? This would change revseq's "outseq" definition to be seqoutall: outseq [ parameter: "Y" type: "dna" extension: "rev" ] For applications like 'notseq' this would require using an operation: seqoutall: outseq [ param: Y type: "@($(sequence.protein)? protein : nucleic)" ] The goal of this is to let researchers use EMBOSS from Python without having to worry about an implementation detail - the existance of a file system. (BTW, I have heard there may be support in 3.0 for XML output and the ability for all the output to be streamed to stdout. I didn't find any details about this on my scan of the web pages - what is the status and plans for this?) ** Topic 3: The Emboss sequence data type contains the calculated attributes of 'protein', 'nucleic' and 'type'. Is it that: - protein is true when the sequence type is 'protein', 'gapprotein', 'pureprotein' or 'stopprotein' - nucleic is true when the sequence type is 'dna', 'rna', 'puredna', 'purerna', 'nucleotide', 'purenucleotide' 'gapnucleotide', 'gapdna', 'gaprna', - protein and nucleic are false for any other case ? ** Topic 4: How do I force a sequence type? The -sprotein and -snucleotide command-line qualifiers are only boolean values, so there doesn't seem to be any way to say an input is really a pureprotein. Eg, there could be a '-stype' qualifier, so I can do '-stype pureprotein'. ** Topic 5: Given the existence of sequence.type, shouldn't most operations of the form "@($(sequence.protein)? protein : nucleic)" really be "$(sequence.type)" ? This should allow better propogation of proper type information through Emboss. == Okay, that's the sequence type related topic. Now for some others, first on parsing ACD files. To get the parameter information I read the ACD files. There are actually two possible files to read: the ".acd" file and the file produced from the "-acdpretty" option. ** Topic 6: Which is the prefered mechanism for getting ACD configuration information? There are advantages and disadvantages to either one. - The .acd file does not require executing a possibly arbitrary program to get its parameter information. This can be a subtle security problem because the mechanism I'm using just does a system() call to see if the program exists, and has no qualms in running "rm-rf / && echo", which expands to the valid command "rm -rf / && echo -acdpretty". By checking the acd file first, it eliminates that possibility, although it does require that the directory containing the .acd definitions be well-known. Is this well-known directory $EMBOSS_ACDROOT or is that a 1.x location? (The other possibility is to require that all Emboss executables and only Emboss executables be in a well-defined directory. Looking at the standard 'configure', the is not usually the case - they get put into /usr/local/bin ) - a problem with using the .acd file is that it may be out of synch with the actual exectuable - the -acdpretty option is problematical in that it writes its information to a file in the local directory. My Python code cannot guarantee that the local directory is writeable, so I need to mkdir a temp directory then "cd $(tmpdir) && $(program) -acdpretty" then read "$(tmpdir)/$(program).acdpretty" then remove the directory. It would be so much easier if -acdpretty option could write to stdout. (Eg, as when used as '-acdpretty -stdout') - the .acd file may use abbreviated names. For example, it may have a qualifier as "param" instead of "parameter". So the -acdpretty text is easier to parse. I would prefer getting the ACD data directly from the executable. Is is possible to allows -stdout as an option to -acdpretty to make it dump to stdout? The other issues I can work around. ** Topic 7: The ACD syntax definition is incomplete. Here are some problems I ran across. > Comments start with "#" and continue to the end of the line. Must the '#' be in the first character position? The function ajacd.c:acdNoComment looks like it truncates the line at the first '#', no matter where it is in the string, so the '#' doesn't need to be the first character. On the other hand, it looks like that bit of code doesn't understand quoted strings. Consider % cat foo.acd appl: foo [ doc: "Who is #1?" groups: "Edit" ] % ../acdc foo Who is groups: "Edit % > Each line is parsed into tokens delimited by spaces What is the definition of a token? We also have that > Parameters and qualifiers are defined by a single token followed by > either a colon ':' (preferred) [1] or an equal sign '=' which in > turn is followed by a second token. This means a token cannot end in a ':' or a '='. But it can contain a ':' outside of quotes, as in opt: @($(showall)?N:Y) Or consider % cat foo.acd appl: foo [ doc: A: ] % ../acdc foo A: % This means the ':' is not part of the first token in a parameter/qualifier but is part of the second token. Spaces aren't really the token delimiter. The file 'wordcount.acd' contains sequence: sequence [ param: Y type: dna] so the token 'dna' is not space delimited before the ']'. Also, checktrans.acd uses 'min:1' which is not space delimited. I'm trying to figure out how ajacd.c does it, but I'm getting lost in the code. To make thing even more confusing % cat foo.acd appl: f"oo [ doc:A]B ] % ../acdc foo A]B % Also, the term 'space' in the documentation should be 'whitespace' since it can skip '\t' characters. Hmm, and looking at the code, there's problems with how it skips the ':' characters. % cat foo.acd appl:::: foo [ doc: "This is the doc." ] % ../acdc foo This is the doc. % And using a NUL character % od -c foo.acd 0000000 a p p l : f o o [ \n 0000020 d o c \0 : " H a s a 0000040 N U L c h a r a c t e r " \n 0000060 S t r a n g e \n 0000100 ] \n 0000102 % ../acdc foo Strange % So the parser code does not fully validate that the input data is in the correct format. > After the name, definitions are in mandatory square brackets, [], > which can make a definition span multiple lines. seqretallfeat.acd contains the following two lines endsection: secoutseq endsection: secinseq which don't have the []. My parse ends up special casing the 'endsection' declarations. Would it be possible to use, say, endsection: secoutseq [] instead? (Also, section and endsection are not defined anywhere in that syntax document.) > Tokens representing data types can be abbreviated up to the point > where they are not ambiguous That's a VMS-help-style shortcut. As I recall, that has a forward-compatibility problem. For example, if a new data type called 'apple' is added, then 'a', 'ap', 'app', and 'appl' are no longer unambiguous. Has there been any consideration on how to deal with that? > Values can be delimited (i.e. treated as one token) by any of the > following pairs, which are stripped as the value is parsed : > > '' {} () [] <> It's not clear what a "value" means? In this section there is token: token [ definition ] But later on this the word 'attributes' is used instead of 'definition': data_type: parameter_name attributes ] and only then does it say what a value is: > A defining attribute must have a second token representing the value > of the attribute. So perhaps there should be some cleanup of the definition. (The reason I needed to figure this out was to check that appl: foo [ "multiword attribute": N, ] was indeed supposed to be illegal.) There doesn't appear to be any way to escape a quote character inside of a quoted token. At least, not that I could see in the code. So there's no way to write something like appl: foo [ doc: "Remove the characters ""{}<>()'" ] for the string Remove the characters "{}<>()' Also, the doc says the valid characters are '' {} () [] <> but that should include "double quotes" And just why are there so many quote characters? ** Topic 8: It took me a while to figure out that ajacd.c did the ACD parsing. The file ajnam.c parses the .embosssrc and emboss.defaults which is described in http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Usa/databases.html and is *almost* in ACD format. The difference is that it doesn't have an 'application' term and the 'DB' needs to be 'DB:'. I can tweak my parser to handle the 'DB' term, but why can't those two files really be in ACD format? ... Although implementation-wise the ajacd.c file uses static variables so it can only be used to parse one file. I noticed a couple problems with how ajname.c works. - It only understands a comment as a '#' in the first character position (while ajacd.c recognizes it anywhere). - The code uses "fgets(line, 512, file)" which looks like it can fail if the line is more than 511 characters, as with long file names. (Actually, since this is a completely different implementation of the parser, the failure conditions are different. For example, there is a namNoColon in ajname.c, but nothing to strip a '='.) ** Topic 9 There needs to be some clarification on the license. When I looked at the code I read the top-level "COPYING" file, which is the GPL. I have a policy not to look at GPL'ed code too closely, since I worry that it may contaminate my ability to write equivalent non-GPL code, like the BSDed Biopython code. LGPL is not quite as bad, but even then I write the non-FSF-licensed code first then if needed for verification I look at the LGPL'ed code. Since the top-level COPYING file is the GPL, that put me off looking at any of the source code, even for verifying format requirements. It had to be pointed out to me that the ajax and nucleus codes are covered under the LGPL. I would not have discovered that on my own, because it the multiple license use wasn't mentiond in the README. In addition, I noticed the ./LICENSE is slightly different than the current Version 2, June 1991 one from the FSF. The FSF address is wrong, and there are formatting changes. I cannot tell if there are any text changes. I also noticed ./COPYING file is the GPL, except for a change in the address and the exclusion of the section "How to Apply These Terms to Your New Programs" Shouldn't these be identical files, and match the current FSF GPL? ** Topic 10 What does the 'warnrange' attribute of an integer do? (I've only lightly scanned the table of data types so will likely have more questions about the other fields in the future.) ** Topic 11 In scanning the code I noticed there is an indirection layer, which I assume is to isolate the programmer from changes in the OS and C library. It isn't used everywhere. For example, there's an ajNamGetenv but several places call getenv directl. I also did a scan looking for possible overflows and other security problems. Because of my inexperience with the indirection layer I couldn't do an in-depth check, but I did notice that ajStrFromFloat and ajStrFromDouble can fail on Inf, -Inf and NaN, for a couple of reasons: % cat inf.c #include #include main() { /* float val = -1.0/0.0; */ float val = strtod("-inf", NULL); char s[100]; int precision = 0, ival, i; sprintf(s, "val == >>%.0f<<", val); puts(s); ival = abs((int) val); printf("ival = %d\n", ival); if (ival) i = precision + (int) log10((double)ival) + 4; else i = precision + 4; printf("i == %d\n", i); } % cc inf.c -lm % ./a.out val == >>-inf<< ival = -2147483648 i == -2147483644 % ** Topic 12 Here's my first pass of the BNF for the ACD file. There are various things to fix, some of which are noted. This can be used for every file in the emoss/acd directory except qatest.acd (which contains a syntax error that acdc doesn't catch -- the "int bint" field) and testplot.acd (contains an '=' instead of a ':', which I don't yet handle). Lexer: colon = ":" open_block = "\[" close_block = "\]" endsection = "endsection" key = "(?!endsection)[a-zA-Z0-9_]+(?=[\s:\][])" value = "[a-zA-Z0-9_]+(?![\s:\][])[^\s\]]* | [^\000-\037a-zA-Z0-9_:[\]\s][^\s\]]*" quoted = '"[^"]*"' (only handles double quotes - need to fix) comment = "[#][^\n\r]*(\r|\r?\n)" SKIPPED whitespace = "\s+" SKIPPED Parser: (need to update the names to match the syntax doc) application ::= widget_list widget_list ::= widget | widget widget_list widget ::= key colon key open_block arglist close_block | key colon key key | key colon key value | endsection colon key arglist ::= arg | arg arglist arg ::= key colon key | key colon value ** Topic 13 One last thing. The parameter information for the different ACD data types is hard coded in ajacd.c. If it was stored in an external data file (in ACD format with well-defined fields :) then my Python code could read that meta-information to build up its tables, rather than me having to code it all by hand. Hope this wasn't too much at once :) Andrew dalke at dalkescientific.com From 962856211 at tay.ac.uk Tue Aug 7 11:17:47 2001 From: 962856211 at tay.ac.uk (962856211 at tay.ac.uk) Date: Tue, 7 Aug 2001 16:17:47 +0100 Subject: free downloads? Message-ID: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk> The list of programs you have at http://www.uk.embnet.org/Software/EMBOSS/Apps/ is it a list of freedownloads? Barry Marshall BSC Hons -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20010807/334a6ae7/attachment.html From gwilliam at hgmp.mrc.ac.uk Tue Aug 7 11:27:39 2001 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Tue, 07 Aug 2001 16:27:39 +0100 Subject: free downloads? References: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk> Message-ID: <3B7008EB.F4437A4F@hgmp.mrc.ac.uk> > 962856211 at tay.ac.uk wrote: > > The list of programs you have at > http://www.uk.embnet.org/Software/EMBOSS/Apps/ > > is it a list of freedownloads? > Barry Marshall BSC Hons This is a list of the applications in the EMBOSS package. The package can be downloaded for free (under the GPL licence) See: http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From dmartin at bioinformatics.msiwtb.dundee.ac.uk Tue Aug 7 12:22:31 2001 From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin) Date: Tue, 7 Aug 2001 17:22:31 +0100 (BST) Subject: free downloads? In-Reply-To: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk> Message-ID: On Tue, 7 Aug 2001 962856211 at tay.ac.uk wrote: > The list of programs you have at > http://www.uk.embnet.org/Software/EMBOSS/Apps/ > > is it a list of freedownloads? EMBOSS is a freely downloadable package licensed under the GPL/LGPL. You will probably want a unix/linux system on which to install it. The admin guide describes in excruciating detail how to do this (look in the documentation section of the web site). If you are in Dundee (at least your email address is) then drop by if you have any questions. ..d ---------------------------------- David Martin PhD Bioinformatics Scientific Officer Wellcome Trust Biocentre, Dundee ---------------------------------- From cbonnard at isrec-sg1.unil.ch Mon Aug 20 05:09:11 2001 From: cbonnard at isrec-sg1.unil.ch (Claude Bonnard) Date: Mon, 20 Aug 2001 11:09:11 +0200 Subject: Database access for EMBOSS Message-ID: <10108201109.ZM13075@isrec-sg1> Hello, It is not very surprising that SRS is the best mode for a fast access to the sequence databases from EMBOSS. As I understood, the URL mode allows the access to a SINGLE sequence and would not support the "USA" standard (wild card query) as SRS mode does. If it is the case, is there a solution when the SRS server is NOT on the same machine, but on a machine which is dedicated to SRS? I have in mind a rsh type of request and I would like to know if someone experience this type of problem and could help me in solving that. Thanks a lot Regards Claude -- Claude Bonnard Ph.D. ISREC (Swiss Institute for Experimental Cancer Research) Bioinformatics Group Ch des Boveresses 155 CH-1066 Epalinges Switzerland phone: [41-21]-692-5891/-2236 fax: [41-21]-652-6933 From peter.rice at uk.lionbioscience.com Mon Aug 20 05:20:55 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 20 Aug 2001 10:20:55 +0100 Subject: Database access for EMBOSS References: <10108201109.ZM13075@isrec-sg1> Message-ID: <3B80D677.99A6DCC8@uk.lionbioscience.com> Claude Bonnard wrote: > It is not very surprising that SRS is the best mode for a fast access > to the sequence databases from EMBOSS. As I understood, the URL mode > allows the access to a SINGLE sequence and would not support the > "USA" standard (wild card query) as SRS mode does. True. We could add an "SRSREMOTE" access mode to extend queries, easy to program but maybe limited practical use. > If it is the case, is there a solution when the SRS server is NOT > on the same machine, but on a machine which is dedicated to SRS? > I have in mind a rsh type of request and I would like to know if > someone experience this type of problem and could help me in > solving that. SRS access mode allows you to define the name of the getz program. How about an alternative name that is a script, and uses rsh to run a remote getz and returns the results? For example, if your script is called 'remotegetz' just add this to the database definition: app: remotegetz (you can use the full path if needed) Note: This was originally added because the Sanger Centre ran 2 versions of SRS (5.1 and 6.0) and I needed to switch between them, but it has other possible uses. -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From dmartin at bioinformatics.msiwtb.dundee.ac.uk Mon Aug 20 05:34:19 2001 From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin) Date: Mon, 20 Aug 2001 10:34:19 +0100 (BST) Subject: Database access for EMBOSS In-Reply-To: <3B80D677.99A6DCC8@uk.lionbioscience.com> Message-ID: On Mon, 20 Aug 2001, Peter Rice wrote: > Claude Bonnard wrote: > > It is not very surprising that SRS is the best mode for a fast access > > to the sequence databases from EMBOSS. As I understood, the URL mode > > allows the access to a SINGLE sequence and would not support the > > "USA" standard (wild card query) as SRS mode does. > > True. We could add an "SRSREMOTE" access mode to extend queries, easy to > program but maybe limited practical use. > > > If it is the case, is there a solution when the SRS server is NOT > > on the same machine, but on a machine which is dedicated to SRS? > > I have in mind a rsh type of request and I would like to know if > > someone experience this type of problem and could help me in > > solving that. > > SRS access mode allows you to define the name of the getz program. > > How about an alternative name that is a script, and uses rsh to run a > remote getz and returns the results? > > For example, if your script is called 'remotegetz' just add this to the > database definition: > > app: remotegetz > > (you can use the full path if needed) > > Note: This was originally added because the Sanger Centre ran 2 versions of > SRS (5.1 and 6.0) and I needed to switch between them, but it has other > possible uses. This would then allow one to add whichever script one wanted as long as it could parse srs style arguements.. It doesn't have to be SRS, just look like it.. The potential is there for wrapping in house rdbms with such a script. I'll add some more comments to the admin guide if Peter can send me details of how EMBOSS calls the wgetz program (not being much of an srs hacker myself). ..d ---------------------------------- David Martin PhD Bioinformatics Scientific Officer Wellcome Trust Biocentre, Dundee ---------------------------------- From peter.rice at uk.lionbioscience.com Mon Aug 20 05:50:22 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 20 Aug 2001 10:50:22 +0100 Subject: Database access for EMBOSS References: Message-ID: <3B80DD5E.F61846E1@uk.lionbioscience.com> David Martin wrote: > > On Mon, 20 Aug 2001, Peter Rice wrote: > > For example, if your script is called 'remotegetz' just add this to the > > database definition: > > > > app: remotegetz > > This would then allow one to add whichever script one wanted as long > as it could parse srs style arguements.. > > It doesn't have to be SRS, just look like it.. The potential is there > for wrapping in house rdbms with such a script. > > I'll add some more comments to the admin guide if Peter can send me > details of how EMBOSS calls the wgetz program (not being much of an srs > hacker myself). This is getz, not wgetz. It supports the full SRS query language because it calls getz (or a user defined script) with an SRS query constructed from the USA. But there is also an access method in general for external applications. You can use this to set up RDBMS calls - which anyway was the original intention. At present it picks up dbname:id or dbname:acc as the rest of the command line, or puts the id/accession into a formatted string (if the application definition includes %s), but can easily be adapted further. -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From cutler at tularik.com Mon Aug 27 14:34:15 2001 From: cutler at tularik.com (Gene Cutler) Date: Mon, 27 Aug 2001 11:34:15 -0700 Subject: drawing trees Message-ID: Hello, all. I have a question about phylogenetic-type trees for sequences. I haven't quite figured out how to do this using emboss/phylip. This is how I have been doing this with gcg: perform multiple sequence alignment (generally hmmalign or clustalw) convert file to msf format if not already (e.g., sreformat from hmmer package) run gcg program distances on the msf file run gcg program growtree on the distances file end up with a postscript file How would I do this with PHYLIP instead? Thanks. From stein at fieldmuseum.org Mon Aug 27 15:20:00 2001 From: stein at fieldmuseum.org (Jennifer Steinbachs) Date: Mon, 27 Aug 2001 14:20:00 -0500 (CDT) Subject: drawing trees In-Reply-To: Message-ID: Use your favourite alignment program... Put your aligned sequences into PHYLIP format Run the appropriate phylip program... distance-based methods: protdist (for proteins) dnadist (for dna) parsimony protpars dnapars likelihood dnaml or dnamlk protml See the phylip website for more info (http://evolution.genetics.washington.edu/phylip.html). If you aren't certain of the differences between the different tree-building algorithms, you should get your hands on Nei and Kumar 2000 Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good reference for phylogenetics is Hillis et al 1996 Molecular Systematics ISBN 0-87893-282-8 Chapter 11. -jennifer On Mon, 27 Aug 2001, Gene Cutler wrote: >Hello, all. I have a question about phylogenetic-type trees for >sequences. I haven't quite figured out how to do this using >emboss/phylip. This is how I have been doing this with gcg: > >perform multiple sequence alignment (generally hmmalign or clustalw) >convert file to msf format if not already (e.g., sreformat from hmmer package) >run gcg program distances on the msf file >run gcg program growtree on the distances file >end up with a postscript file > >How would I do this with PHYLIP instead? > >Thanks. > > ----------------------------------- J. Steinbachs, Ph.D. Computational Biologist http://compbiology.org ----------------------------------- From cutler at tularik.com Mon Aug 27 16:19:56 2001 From: cutler at tularik.com (Gene Cutler) Date: Mon, 27 Aug 2001 13:19:56 -0700 Subject: drawing trees In-Reply-To: References: Message-ID: Thanks Jennifer. One more question: >Put your aligned sequences into PHYLIP format I didn't see any information on the phylip webpage about phylip format and/or conversion tools. I can do the conversion myself if I can find documentation on the format. >If you aren't certain of the differences between the different >tree-building algorithms, you should get your hands on Nei and Kumar 2000 >Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good >reference for phylogenetics is Hillis et al 1996 Molecular Systematics >ISBN 0-87893-282-8 Chapter 11. That's useful too. Thanks again. -- -=-=-=-=-=-=-=-=-=-=-=-=-=- Gene Cutler Bioinformatics Scientist cutler at tularik.com - - - - - - - - - - - - - Tularik Inc 2 Corporate Drive South San Francisco, CA 94080, USA http://www.tularik.com -=-=-=-=-=-=-=-=-=-=-=-=-=- From stein at fieldmuseum.org Mon Aug 27 17:09:55 2001 From: stein at fieldmuseum.org (Jennifer Steinbachs) Date: Mon, 27 Aug 2001 16:09:55 -0500 (CDT) Subject: drawing trees In-Reply-To: Message-ID: I like to use Seaview to make the conversion (I don't have the website handy but a google search should produce it quickly). ClustalX (and maybe clustalw) also produce Phylip files. I thought the Phylip website had information on the format, but it's been a while since I've actually perused the documentation. The phylip docs should definitely have complete information. If I recall correctly, it is something like: #sequences #nucleotide_sites sequence_name sequence sequence_name sequence etc. There used to be a 10 character limit on sequence_name, but I don't know if that holds with the latest version - I use PAUP* mostly for my analyses. Sequence can be non-interleaved or interleaved. -jennifer On Mon, 27 Aug 2001, Gene Cutler wrote: >Thanks Jennifer. One more question: > >>Put your aligned sequences into PHYLIP format > >I didn't see any information on the phylip webpage about phylip format >and/or conversion tools. I can do the conversion myself if I can find >documentation on the format. > >>If you aren't certain of the differences between the different >>tree-building algorithms, you should get your hands on Nei and Kumar 2000 >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics >>ISBN 0-87893-282-8 Chapter 11. > >That's useful too. Thanks again. > > > -- ----------------------------------- J. Steinbachs, Ph.D. Computational Biologist http://compbiology.org ----------------------------------- From jrvalverde at cnb.uam.es Tue Aug 28 02:05:15 2001 From: jrvalverde at cnb.uam.es (jrvalverde at cnb.uam.es) Date: Tue, 28 Aug 2001 08:05:15 +0200 (DST) Subject: drawing trees In-Reply-To: Message-ID: <200108280605.f7S65GE1348757@embnet.cnb.uam.es> Gene Cutler wrote: > Thanks Jennifer. One more question: > > >Put your aligned sequences into PHYLIP format > > I didn't see any information on the phylip webpage about phylip format > and/or conversion tools. I can do the conversion myself if I can find > documentation on the format. It's on the package documentation, but if you are already using CLUSTAL, then simply go to "multiple alignments", choose "output options" and then select "PHYLIP" format. As for the details, either look at the documentation ("main.doc") or to EMBnet's Quick Guide to PHYLIP (PDF and HTML versions may still be found at http://www.es.embnet.org/~pprpc/activs/PHYLIPGuide/PhylipGuide-1.6.html j From frank at bioss.ac.uk Tue Aug 28 03:45:33 2001 From: frank at bioss.ac.uk (Frank Wright) Date: Tue, 28 Aug 2001 08:45:33 +0100 Subject: drawing trees References: Message-ID: <3B8B4C1D.40315E3C@bioss.ac.uk> Hi Gene, >I didn't see any information on the phylip webpage about phylip format >and/or conversion tools. I can do the conversion myself if I can find >documentation on the format. PHYLIP FORMAT ------------- PHYLIP format is discussed in the PHYLIP documentation "main" file: http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html See section 6.Overview of the input and output formats/ subsection 1.Input File Format See the PHYLIP "sequences" documentation for details of how PHYLIP codes unknowns and gaps: http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html READSEQ ------- The READSEQ program is very useful for converting between alignment formats (you may have to edit "." to be "-" though as PHYLIP codes gaps differently). ftp://ftp.bio.indiana.edu/molbio/readseq/ To use: Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a There is an old version and a Java version. I prefer the old one :-). Other programs -------------- Jennifer Steinbachs has suggested some programs for reformatting alignments. Here's some additional comments: There are other programs that can reformat alignment files. CLUSTALW can be used to read in alignments (option 1) and write them out (option 2, suboption 9) without doing an alignment. I've not used SEQIO but it looks useful: http://bioweb.pasteur.fr/docs/seqio/seqio.html On a PC you could use "export" and "import" facilities in GENEDOC, an excellent alignment editor. http://www.psc.edu/biomed/genedoc/ Best Wishes, Frank -- Frank Wright Biomathematics and Statistics Scotland, SCRI, DUNDEE DD2 5DA, Scotland frank at bioss.sari.ac.uk From letondal at pasteur.fr Tue Aug 28 03:51:50 2001 From: letondal at pasteur.fr (Catherine Letondal) Date: Tue, 28 Aug 2001 09:51:50 +0200 Subject: drawing trees In-Reply-To: Your message of "Tue, 28 Aug 2001 08:45:33 BST." <3B8B4C1D.40315E3C@bioss.ac.uk> Message-ID: <200108280751.f7S7poM220418@electre.pasteur.fr> Frank Wright wrote: > Hi Gene, > > >I didn't see any information on the phylip webpage about phylip format > >and/or conversion tools. I can do the conversion myself if I can find > >documentation on the format. Our Phylip Web server (http://bioweb/seqanal/phylogeny/phylip-uk.html) may help with format conversion as well as phylogenetic programs chaining. > > PHYLIP FORMAT > ------------- > > PHYLIP format is discussed in the PHYLIP documentation "main" file: > > http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html > > See section 6.Overview of the input and output formats/ subsection > 1.Input File Format > See the PHYLIP "sequences" documentation for details of how PHYLIP codes > unknowns and gaps: > > http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html > > READSEQ > ------- > The READSEQ program is very useful for converting between alignment > formats (you may have to edit "." to be "-" though as PHYLIP codes gaps > differently). > > ftp://ftp.bio.indiana.edu/molbio/readseq/ > > To use: > > Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a > > There is an old version and a Java version. I prefer the old one :-). > > > Other programs > -------------- > > Jennifer Steinbachs has suggested some programs for reformatting > alignments. Here's some additional comments: > > There are other programs that can reformat alignment files. CLUSTALW > can be used to read in alignments (option 1) and write them out (option > 2, suboption 9) without doing an alignment. I've not used SEQIO but it > looks useful: > > http://bioweb.pasteur.fr/docs/seqio/seqio.html > > On a PC you could use "export" and "import" facilities in GENEDOC, an > excellent alignment editor. > > http://www.psc.edu/biomed/genedoc/ > > > > Best Wishes, > Frank > -- > Frank Wright > Biomathematics and Statistics Scotland, > SCRI, DUNDEE DD2 5DA, Scotland > frank at bioss.sari.ac.uk -- Catherine Letondal -- Pasteur Institute Computing Center From frank at bioss.ac.uk Tue Aug 28 04:04:13 2001 From: frank at bioss.ac.uk (Frank Wright) Date: Tue, 28 Aug 2001 09:04:13 +0100 Subject: drawing trees References: Message-ID: <3B8B507D.B7639FE5@bioss.ac.uk> Hi All, Gene Cutler asked: >Hello, all. I have a question about phylogenetic-type trees for >sequences. I haven't quite figured out how to do this using >emboss/phylip. This is how I have been doing this with gcg: > >run gcg program distances on the msf file >run gcg program growtree on the distances file > >How would I do this with PHYLIP instead? The GCG DISTANCES program and GCG GROWTREE programs are very similar to the DNADIST/PROTDIST and Neighbor programs in PHYLIP. In other words, they allow phylogenetic trees to be constructed using "distance-based" methods, but do not allow maximum likelihood or parsimony methods to be used. They also don't do bootstrapping tests, tree comparisons, and lots of other things. If you are using distance-based phylogenetic methods, some notes: (1) Weighted least-squares is slower but more accurate than Neighbor-Joining, so use the PHYLIP FITCH program instead of NEIGHBOR. (2) Recently, the PHYLIP-like WEIGHBOR program (Weighted Neighbor-Joining) has been released. WEIGHBOR appears to be an improvement on Neighbor-Joining (and possibly weighted least squares). See http://www.t10.lanl.gov/billb/weighbor/. I've not tried it out much but the simulations in the published paper look convincing. (3) PHYLIP (version 3.6) has improved DNADIST (more distance methods) and PROTDIST (rate heterogeneity among sites added). Best Wishes, Frank -- Frank Wright Biomathematics and Statistics Scotland, SCRI, DUNDEE DD2 5DA, Scotland frank at bioss.sari.ac.uk From dmartin at bioinformatics.msiwtb.dundee.ac.uk Tue Aug 28 04:26:00 2001 From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin) Date: Tue, 28 Aug 2001 09:26:00 +0100 (BST) Subject: drawing trees In-Reply-To: Message-ID: Remember that in the Phylip programs distributed as an EMBASSY package, EMBOSS will do the sequence conversions for you. The programs to run are the same name with an e prepended. It also has the various options in ACD format so the programs can be fully scripted. The two programs that haven't been EMBOSSised are DRAWTREE and DRAWGRAM. ..d On Mon, 27 Aug 2001, Jennifer Steinbachs wrote: > > I like to use Seaview to make the conversion (I don't have the website > handy but a google search should produce it quickly). ClustalX (and maybe > clustalw) also produce Phylip files. I thought the Phylip website had > information on the format, but it's been a while since I've actually > perused the documentation. The phylip docs should definitely have > complete information. > > If I recall correctly, it is something like: > > #sequences #nucleotide_sites > sequence_name sequence > sequence_name sequence > > etc. > > There used to be a 10 character limit on sequence_name, but I don't know > if that holds with the latest version - I use PAUP* mostly for my > analyses. Sequence can be non-interleaved or interleaved. > > -jennifer > > On Mon, 27 Aug 2001, Gene Cutler wrote: > > >Thanks Jennifer. One more question: > > > >>Put your aligned sequences into PHYLIP format > > > >I didn't see any information on the phylip webpage about phylip format > >and/or conversion tools. I can do the conversion myself if I can find > >documentation on the format. > > > >>If you aren't certain of the differences between the different > >>tree-building algorithms, you should get your hands on Nei and Kumar 2000 > >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good > >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics > >>ISBN 0-87893-282-8 Chapter 11. > > > >That's useful too. Thanks again. > > > > > > > > ---------------------------------- David Martin PhD Bioinformatics Scientific Officer Wellcome Trust Biocentre, Dundee ---------------------------------- From pscotney at hotmail.com Sat Aug 25 09:55:16 2001 From: pscotney at hotmail.com (Pierre Scotney) Date: Sat, 25 Aug 2001 23:55:16 +1000 Subject: [EMBOSS] EMBOSS and Jemboss installation problems SOLVED! Message-ID: Hello! I have solved the GNU/Linux EMBOSS and Jemboss installation problems :) The solution was: 1) edit both /etc/profile (for bash) and /etc/csh.login (for csh) so that $PATH includes /usr/local/lib/j2sdk1.4.2/bin path, previously only bash had the correct path to the java binaries. 2) use j2sdk1.4.2 as EMBOSS-2.8.0 will not build with j2sdk1.3.1 (Blackdown Java-Linux). May be the documentation/scripts will need to be changed to reflect this issue. Cheers Pierre -- Dr Pierre Scotney Melbourne Australia _________________________________________________________________ Get Extra Storage in 10MB, 25MB, 50MB and 100MB options now! Go to http://join.msn.com/?pgmarket=en-au&page=hotmail/es2 From g38909015 at mailsrv.ym.edu.tw Wed Aug 1 05:06:37 2001 From: g38909015 at mailsrv.ym.edu.tw (TerryYeh-YM) Date: Wed, 1 Aug 2001 13:06:37 +0800 Subject: Join mailing list Message-ID: <000a01c11a47$be86f3c0$46146e8c@nchc.gov.tw> ------------------------------------------------------ Chang-Wei Yeh (Terry Yeh) National Yang Ming University College of Life Science Institute of Anatomy and Cell Biology Bioinformatics Program and Core Lab ------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.andrews at bbsrc.ac.uk Wed Aug 1 12:56:08 2001 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Wed, 1 Aug 2001 13:56:08 +0100 Subject: [EMBOSS] Getting headers from Seqret Message-ID: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk> [sent to Emboss mailing list] Dear All, I'm having trouble getting header information back through seqret, from a database formatted using dbiflat against a genbank flat file (refseq actually). I'm sure plenty of people must have done this before, but I've read through the documentation, and I can't see where I'm going wrong! The database formatted OK, and I can fetch sequences back from it, but at some point I will need to retrieve the entire header from the original file to get at some of the extra information in there (feature tables, cross references, authors etc). I've tried several different output USAs with seqret, but the most I can seem to get back is the name, accession number and description. I can't believe that this information is thrown away by seqret (it's still there in the flat file after all), so how can I retrieve it? Thanks for any help Simon [Potentially useful details follow] ---- Simon Andrews PhD Bioinformatics Dept The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0)1223 496463 ########################################################################## Emboss version = 2.0.0 Platform = DEC alpha (OSF1 v4.0) My emboss.default entry for the database looks like; DB refseq [ type: N method: emblcd format: gb dir: /usr/users/andrewss/Refseq/Genbank file: "*.gbff" release: "1.0" comment: "Refseq Hum Mus Rat" ] and an example of the output of seqret with a debug USA is (with the documentation space suspiciously blank!); Sequence output trace ===================== Name: 'NM_031360' Accession: 'NM_031360' Description: 'Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA.' Type: 'N' Database: 'refseq' Full name: '' Date: '' Usa: 'debug::test.seq' Ufo: '' Input format: 'gb' Output format: 'debug' Filename: 'test.seq' Entryname: 'NM_031360' File name: 'test.seq' Extension: 'fasta' Single: 'No' Features: 'No' Count: 'No' Documentation:... 1 atgaagcaca acttttctct gcggctgagg gttttcaacc tcaactgctg 50 51 ggacatcccc tacctaagca agcatagggc cgaccgcatg aagcgcttgg 100 etc. The extra stuff I'm after is this sort of thing; LOCUS NM_031360 1269 bp mRNA ROD 12-JUN-2001 DEFINITION Rattus norvegicus neutral sphingomyelinase (Smpd2), mRNA. ACCESSION NM_031360 VERSION NM_031360.1 GI:14389300 KEYWORDS . SOURCE Norway rat. ORGANISM Rattus norvegicus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. REFERENCE 1 (sites) AUTHORS Mizutani,Y., Tamiya-Koizumi,K., Irie,F., Hirabayashi,Y., Miwa,M. and Yoshida,S. TITLE Cloning and expression of rat neutral sphingomyelinase: enzymological characterization and identification of essential histidine residues JOURNAL Biochim. Biophys. Acta 1485 (2-3), 236-246 (2000) MEDLINE 20292884 COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AB047002.1. FEATURES Location/Qualifiers source 1..1269 /organism="Rattus norvegicus" /strain="Sprague-Dawley" /db_xref="taxon:10116" /chromosome="X" /chromosome="14" /chromosome="2" /chromosome="3" /chromosome="17" /map="Xq28" /map="14q" /map="2 36.0 cM" /map="Xq11.1" /map="3" /map="17q12-q21" /sex="male" /tissue_type="liver" /clone_lib="rat liver lambda cDNA library (STRATAGENE,#936513)" gene 1..1269 /gene="Smpd2" /note="EBS3; EBS4; K14; CK; MAGE5; MAGE10; Tdo; Araf" /db_xref="LocusID:83537" /db_xref="MGD:MGI:98246" /db_xref="MIM:148066" /db_xref="MIM:300340" /db_xref="MIM:300343" /db_xref="MIM:601443" /db_xref="RATMAP:36372" /db_xref="RGD:36372" CDS 1..1269 /gene="Smpd2" /note="lyso-platelet activating factor-phospholipase C; cytokeratin 14; Raf related protein; Synaptosomal-associated protein" /codon_start=1 /db_xref="LocusID:83537" /db_xref="MGD:MGI:98246" /db_xref="MIM:148066" /db_xref="MIM:300340" /db_xref="MIM:300343" /db_xref="MIM:601443" /db_xref="RATMAP:36372" /db_xref="RGD:36372" /product="neutral sphingomyelinase" /protein_id="NP_112650.1" /db_xref="GI:14389301" /translation="MKHNFSLRLRVFNLNCWDIPYLSKHRADRMKRLGDFLNLESFDL ALLEEVWSEQDFQYLKQKLSLTYPDAHYFRSGIIGSGLCVFSRHPIQEIVQHVYTLNG YPYKFYHGDWFCGKAVGLLVLHLSGLVLNAYVTHLHAEYSRQKDIYFAHRVAQAWELA QFIHHTSKKANVVLLCGDLNMHPKDLGCCLLKEWTGLRDAFVETEDFKGSEDGCTMVP KNCYVSQQDLGPFPFGVRIDYVLYKAVSGFHICCKTLKTTTGCDPHNGTPFSDHEALM ATLCVKHSPPQEDPCSAHGSAERSALISALREARTELGRGIAQARWWAALFGYVMILG LSLLVLLCVLAAGEEAREVAIMLWTPSVGLVLGAGAVYLFHKQEAKSLCRAQAEIQHV LTRTTETQDLGSEPHPTHCRQQEADRAEEK" misc_feature 91..837 /note="AP_endonucleas1; Region: AP endonuclease family 1" From peter.rice at uk.lionbioscience.com Wed Aug 1 13:12:57 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Wed, 01 Aug 2001 14:12:57 +0100 Subject: [EMBOSS] Getting headers from Seqret References: <2DC41140A89ED411989D00508BDCD9EDEA4FF2@bi-exsrv1.iapc.bbsrc.ac.uk> Message-ID: <3B680059.74B4594@uk.lionbioscience.com> "simon andrews (BI)" wrote: > The database formatted OK, and I can fetch sequences back from it, but > at some point I will need to retrieve the entire header from the > original file to get at some of the extra information in there > (feature tables, cross references, authors etc). > > I've tried several different output USAs with > seqret, but the most I can seem to get back is the name, accession number > and description. It all depends on how much information we store in the internal data structures. As standard, we keep the ID, Accession, Description and sequence so we can write a FASTA format file easily. We also keep the complete feature table, but only optionally. seqret ignores it, but seqretallfeat reads and writes it. Most programs only need the sequence data and parsing feature information wastes time and space on large sequences. We can also read the entire text of an entry with entret, assuming you want the original flatfile format. >I can't believe that this information is thrown away by seqret > (it's still there in the flat file after all), Yes, it is (but we can easily read more fields - the problem is whether we can convert them to other file formats easily) > so how can I retrieve it? Using entret - which sounds like the solution you need. -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From ableasby at hgmp.mrc.ac.uk Wed Aug 1 17:15:25 2001 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Wed, 1 Aug 2001 18:15:25 +0100 (BST) Subject: EMBOSS patchfiles directory Message-ID: <200108011715.SAA26106@bromine.hgmp.mrc.ac.uk> Just a reminder that, between EMBOSS releases, occasional bugfixes are placed in the directory: ftp://ftp.uk.embnet.org/pub/EMBOSS/patchfiles/ There are currently two replacement files in that directory. marscan.c showfeat.c Both are replacements for applications in the EMBOSS-2.0.1/emboss directory. Alan From gbottu at ben.vub.ac.be Thu Aug 2 17:00:02 2001 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Thu, 2 Aug 2001 19:00:02 +0200 (MET DST) Subject: databanks in PIR format Message-ID: <200108021700.TAA24275@bigben.vub.ac.be> from : BEN Dear colleagues, Has anybody already successfully accessed databanks in PIR NBRF or CODATA format under EMBOSS ? I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is indexed under SRS). My emboss.default file contains : DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant' methodquery: srs dbalias: PIR_NR methodall: direct dir: /seq/protein/flat file: pir_nr.seq ] But this does not work. E.g. seqret pir_nr:e69549 gives an output file : >E69549 conserved hypothetical protein AF2396 - Archaeoglobus fulgidus >E69549 MTVVPLSALREGQEGRVVAINGGRGCTARLMSMGIVPGKKIRIAGRRGGAVLVSVNGTKF VIGRGLAMKVAVDVGEQG Guy Bottu From peter.rice at uk.lionbioscience.com Thu Aug 2 17:28:29 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Thu, 02 Aug 2001 18:28:29 +0100 Subject: databanks in PIR format References: <200108021700.TAA24275@bigben.vub.ac.be> Message-ID: <3B698DBD.9984ED23@uk.lionbioscience.com> Guy Bottu wrote: > I have EMBOSS 2.0.0 and a databank in PIR format (the version in NBRF format is > indexed under SRS). My emboss.default file contains : > > DB pir_nr [ type: P format: nbrf comment: 'PIR nonredundant' > methodquery: srs dbalias: PIR_NR > methodall: direct dir: /seq/protein/flat file: pir_nr.seq > ] > > But this does not work. E.g. seqret pir_nr:e69549 gives an output file This is because of problems in SRS converting PIR entries to PIR format. This has been the same since the days of SRS 5, but I have passed it on to the support guys here to take a look. Seems nobody has been retrieving PIR entries in their original format. For example, see PIR on the SRS 5 server at MIPS: http://srs-mips.gsf.de/srs5bin/cgi-bin/wgetz?-id+2trYB1GreRI+-e+[PIR-ID:'E69549'] You can get queries to work with: DB pir_nr [ type: P format: fasta comment: 'PIR nonredundant' methodquery: srsfasta dbalias: PIR_NR methodall: direct dir: /seq/protein/flat file: pir_nr.seq ] ... but the fasta format required for srsfasta will not let you work with direct access to all entries. srs access does getz -e srsfasta access does getz -d -sf fasta regards, Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From peter.rice at uk.lionbioscience.com Thu Aug 2 17:59:20 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Thu, 02 Aug 2001 18:59:20 +0100 Subject: databanks in PIR format References: <200108021700.TAA24275@bigben.vub.ac.be> <3B698DBD.9984ED23@uk.lionbioscience.com> Message-ID: <3B6994F8.F4A5A403@uk.lionbioscience.com> >This is because of problems in SRS converting PIR entries to PIR format. >This has been the same since the days of SRS 5, but I have passed it on to >the support guys here to take a look. Quick fix would be to change the format in pir.i to be "plain" and run srssection. This gives PIR format without the trailing * but is good enough to make EMBOSS happy. Then Guy's original definition should work. regards, Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From gbottu at ben.vub.ac.be Fri Aug 3 09:43:57 2001 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 3 Aug 2001 11:43:57 +0200 (MET DST) Subject: databanks in PIR format Message-ID: <200108030943.LAA11981@bigben.vub.ac.be> >Quick fix would be to change the format in pir.i to be "plain" and run >srssection. > >This gives PIR format without the trailing * but is good enough to make >EMBOSS happy. Then Guy's original definition should work. > I tried and it worked ! Thanks for the advice. Still, there must be some nasty bug hidden in the SRS code, since similar problem does not occur with EMBL and GenBank formats. Let's hope they can fix it. Guy Bottu From gbottu at ben.vub.ac.be Fri Aug 3 12:48:02 2001 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 3 Aug 2001 14:48:02 +0200 (MET DST) Subject: problem with remote databank access Message-ID: <200108031248.OAA26689@bigben.vub.ac.be> from : BEN Dear support, While experimenting with remote databank access I noticed the following : DB GENBANK [ type: N format: genbank method: url comment: 'GenBank at Institut Pasteur (Paris, France)' url: "http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]" ] does work fine. However, with : DB GENBANK [ type: N format: genbank method: url comment: 'GenBank at DKFZ (Heidelberg, Germany)' url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genbank- acc:%s]" ] seqret genbank:X15320 retrieves a file : >ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC 6.1.1.19 The problem is probably that at the DKFZ they index the databank in GCG format. However, replacing "format: genbank" by "format: gcg" does not work. Guy Bottu From jackl at dalicon.com Fri Aug 3 13:11:41 2001 From: jackl at dalicon.com (Jack Leunissen) Date: Fri, 3 Aug 2001 15:11:41 +0200 Subject: problem with remote databank access References: <200108031248.OAA26689@bigben.vub.ac.be> Message-ID: <009001c11c1d$d74aaff0$0400a8c0@cmbipc32> No, the problem is that their default output format is EMBL! And that seems to upset EMBOSS, as it expect GENBANK format for the sequence information too. Changing the call to: url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+-sf+g enbank+[genbank-acc:%s]" does the trick! (note the addition: +-sf+genbank" to force the sequence output in GENBANK format). Cheers, Jack Jack A.M. Leunissen Email: jackl at cmbi.kun.nl Centre for Molecular and Tel : +31 24 365 22 48 Biomolecular Informatics Fax : +31 24 365 29 77 Nijmegen, Netherlands http://www.cmbi.kun.nl/ ----- Original Message ----- From: "Guy Bottu" To: Cc: ; Sent: Friday, August 03, 2001 2:48 PM Subject: problem with remote databank access > from : BEN > > Dear support, > > While experimenting with remote databank access I noticed the following : > > DB GENBANK [ type: N format: genbank method: url > comment: 'GenBank at Institut Pasteur (Paris, France)' > url: "http://srs.pasteur.fr/cgi-bin/srs6/wgetz?-e+[genbank-acc:%s]" > ] > > does work fine. However, with : > > DB GENBANK [ type: N format: genbank method: url > comment: 'GenBank at DKFZ (Heidelberg, Germany)' > > url:"http://genius.embnet.dkfz-heidelberg.de/menu/cgi-bin/srs/wgetz?-e+[genb ank- > acc:%s]" > ] > > seqret genbank:X15320 retrieves a file : > > >ECARGS X15320 Escherichia coli argS gene for arginyl-tRNA-synthetase (EC > 6.1.1.19 > > The problem is probably that at the DKFZ they index the databank in GCG format. > However, replacing "format: genbank" by "format: gcg" does not work. > > Guy Bottu > > From peter.rice at uk.lionbioscience.com Fri Aug 3 15:07:39 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Fri, 03 Aug 2001 16:07:39 +0100 Subject: databanks in PIR format References: <200108030943.LAA11981@bigben.vub.ac.be> Message-ID: <3B6ABE3B.5EBE5C2C@uk.lionbioscience.com> Guy Bottu wrote: > >Quick fix would be to change the format in pir.i to be "plain" and run > >srssection. > > I tried and it worked ! Thanks for the advice. > > Still, there must be some nasty bug hidden in the SRS code, since similar > problem does not occur with EMBL and GenBank formats. Let's hope they > can fix it. "It's not a bug, it's a feature" As it has been there since SRS 5.0 (at least) requres changes to the C source code (so that PIR format behaves the same way as EMBL) it will have to wait for a future release. Meanwhile, the plain fix will work well enough - some software may want a trailing '*' but probably most programs will be happy. Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From dalke at dalkescientific.com Mon Aug 6 00:52:59 2001 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 6 Aug 2001 01:52:59 +0100 Subject: questions about ACD format Message-ID: <005101c11e12$470a8180$0201a8c0@josiah.dalkescientific.com> [Brief summary: I'm trying to integrate Emboss with Biopython and found that 1) not enough sequence type information is available in the ACD file for Biopython's AlphabetStrict code to work, so I have a proposal to fix the, 2) I have questions about how to interpret some of the documentation, 3) there are places where the Emboss ACD parser doesn't appear to work correctly, and 4) general observations on the ACD format and on the implementation.] Hello, First off, my apologies if this is the wrong email address for this topic. I couldn't find any archives to scan for verification. I am also not a member of this list, so please cc me on any replies. Based on the feedback I got from some people at ISMB, I've started a Python interface to EMBOSS. The goal is to be able to do something like: >>> from Bio import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.Emboss import apps >>> >>> seq = Seq.Seq("AATCCATCGATGCAC", IUPAC.unambiguous_dna) >>> results = apps.revseq(sequence = seq) >>> results["outseq"] Emboss.EmbossSeq("GTGCATCGATGGATT", IUPAC.ambiguous_dna) >>> I can almost, but not quite do this, for some reasons I'll describe shortly. Here are the questions and problems I had in doing this, as well as some specific feature I would like to see added, which I feel may make it easier to integrate EMBOSS with other systems. ====== ** Topic 1 As you can see in the above example, there is some automatic conversion going on. One is to convert the Biopython 'Seq' object to a temporary file, so it can be used with the '-sequence' parameter needed by revseq. This is done by knowing how to convert the Seq object to a 'seqall' Emboss type, including looking at the 'type' field to ensure that the input sequence is really DNA. The conversion step requires that I do a verification of the Biopython Seq Alphabet to the Emboss sequence 'type'. There is a description of the types in the syntax document, at http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html but it doesn't describe: 1) what is used as a gap character? (I assume '-') 2) what is used for a stop character? (I assume '*') 3) are selenocysteines encoded with a U? (the pureprotein definition says it excludes "BZ or X", so I'm guessing selenocysteines aren't allowed - or are they encoded as X?) 4) shouldn't there be a gapstopprotein? ** Topic 2 Another conversion is to create a temporary filename for the -outseq parameter, based on the 'seqoutall' Emboss type. I would like to read the contents of this file into a Biopython Seq object, however, the ACD description does not contain enough information for me to do that. Instead, I can only create the tempfile and store the filename in the "outseq" parameter. Could a new 'type' parameter be added to 'seqoutall'? This would change revseq's "outseq" definition to be seqoutall: outseq [ parameter: "Y" type: "dna" extension: "rev" ] For applications like 'notseq' this would require using an operation: seqoutall: outseq [ param: Y type: "@($(sequence.protein)? protein : nucleic)" ] The goal of this is to let researchers use EMBOSS from Python without having to worry about an implementation detail - the existance of a file system. (BTW, I have heard there may be support in 3.0 for XML output and the ability for all the output to be streamed to stdout. I didn't find any details about this on my scan of the web pages - what is the status and plans for this?) ** Topic 3: The Emboss sequence data type contains the calculated attributes of 'protein', 'nucleic' and 'type'. Is it that: - protein is true when the sequence type is 'protein', 'gapprotein', 'pureprotein' or 'stopprotein' - nucleic is true when the sequence type is 'dna', 'rna', 'puredna', 'purerna', 'nucleotide', 'purenucleotide' 'gapnucleotide', 'gapdna', 'gaprna', - protein and nucleic are false for any other case ? ** Topic 4: How do I force a sequence type? The -sprotein and -snucleotide command-line qualifiers are only boolean values, so there doesn't seem to be any way to say an input is really a pureprotein. Eg, there could be a '-stype' qualifier, so I can do '-stype pureprotein'. ** Topic 5: Given the existence of sequence.type, shouldn't most operations of the form "@($(sequence.protein)? protein : nucleic)" really be "$(sequence.type)" ? This should allow better propogation of proper type information through Emboss. == Okay, that's the sequence type related topic. Now for some others, first on parsing ACD files. To get the parameter information I read the ACD files. There are actually two possible files to read: the ".acd" file and the file produced from the "-acdpretty" option. ** Topic 6: Which is the prefered mechanism for getting ACD configuration information? There are advantages and disadvantages to either one. - The .acd file does not require executing a possibly arbitrary program to get its parameter information. This can be a subtle security problem because the mechanism I'm using just does a system() call to see if the program exists, and has no qualms in running "rm-rf / && echo", which expands to the valid command "rm -rf / && echo -acdpretty". By checking the acd file first, it eliminates that possibility, although it does require that the directory containing the .acd definitions be well-known. Is this well-known directory $EMBOSS_ACDROOT or is that a 1.x location? (The other possibility is to require that all Emboss executables and only Emboss executables be in a well-defined directory. Looking at the standard 'configure', the is not usually the case - they get put into /usr/local/bin ) - a problem with using the .acd file is that it may be out of synch with the actual exectuable - the -acdpretty option is problematical in that it writes its information to a file in the local directory. My Python code cannot guarantee that the local directory is writeable, so I need to mkdir a temp directory then "cd $(tmpdir) && $(program) -acdpretty" then read "$(tmpdir)/$(program).acdpretty" then remove the directory. It would be so much easier if -acdpretty option could write to stdout. (Eg, as when used as '-acdpretty -stdout') - the .acd file may use abbreviated names. For example, it may have a qualifier as "param" instead of "parameter". So the -acdpretty text is easier to parse. I would prefer getting the ACD data directly from the executable. Is is possible to allows -stdout as an option to -acdpretty to make it dump to stdout? The other issues I can work around. ** Topic 7: The ACD syntax definition is incomplete. Here are some problems I ran across. > Comments start with "#" and continue to the end of the line. Must the '#' be in the first character position? The function ajacd.c:acdNoComment looks like it truncates the line at the first '#', no matter where it is in the string, so the '#' doesn't need to be the first character. On the other hand, it looks like that bit of code doesn't understand quoted strings. Consider % cat foo.acd appl: foo [ doc: "Who is #1?" groups: "Edit" ] % ../acdc foo Who is groups: "Edit % > Each line is parsed into tokens delimited by spaces What is the definition of a token? We also have that > Parameters and qualifiers are defined by a single token followed by > either a colon ':' (preferred) [1] or an equal sign '=' which in > turn is followed by a second token. This means a token cannot end in a ':' or a '='. But it can contain a ':' outside of quotes, as in opt: @($(showall)?N:Y) Or consider % cat foo.acd appl: foo [ doc: A: ] % ../acdc foo A: % This means the ':' is not part of the first token in a parameter/qualifier but is part of the second token. Spaces aren't really the token delimiter. The file 'wordcount.acd' contains sequence: sequence [ param: Y type: dna] so the token 'dna' is not space delimited before the ']'. Also, checktrans.acd uses 'min:1' which is not space delimited. I'm trying to figure out how ajacd.c does it, but I'm getting lost in the code. To make thing even more confusing % cat foo.acd appl: f"oo [ doc:A]B ] % ../acdc foo A]B % Also, the term 'space' in the documentation should be 'whitespace' since it can skip '\t' characters. Hmm, and looking at the code, there's problems with how it skips the ':' characters. % cat foo.acd appl:::: foo [ doc: "This is the doc." ] % ../acdc foo This is the doc. % And using a NUL character % od -c foo.acd 0000000 a p p l : f o o [ \n 0000020 d o c \0 : " H a s a 0000040 N U L c h a r a c t e r " \n 0000060 S t r a n g e \n 0000100 ] \n 0000102 % ../acdc foo Strange % So the parser code does not fully validate that the input data is in the correct format. > After the name, definitions are in mandatory square brackets, [], > which can make a definition span multiple lines. seqretallfeat.acd contains the following two lines endsection: secoutseq endsection: secinseq which don't have the []. My parse ends up special casing the 'endsection' declarations. Would it be possible to use, say, endsection: secoutseq [] instead? (Also, section and endsection are not defined anywhere in that syntax document.) > Tokens representing data types can be abbreviated up to the point > where they are not ambiguous That's a VMS-help-style shortcut. As I recall, that has a forward-compatibility problem. For example, if a new data type called 'apple' is added, then 'a', 'ap', 'app', and 'appl' are no longer unambiguous. Has there been any consideration on how to deal with that? > Values can be delimited (i.e. treated as one token) by any of the > following pairs, which are stripped as the value is parsed : > > '' {} () [] <> It's not clear what a "value" means? In this section there is token: token [ definition ] But later on this the word 'attributes' is used instead of 'definition': data_type: parameter_name attributes ] and only then does it say what a value is: > A defining attribute must have a second token representing the value > of the attribute. So perhaps there should be some cleanup of the definition. (The reason I needed to figure this out was to check that appl: foo [ "multiword attribute": N, ] was indeed supposed to be illegal.) There doesn't appear to be any way to escape a quote character inside of a quoted token. At least, not that I could see in the code. So there's no way to write something like appl: foo [ doc: "Remove the characters ""{}<>()'" ] for the string Remove the characters "{}<>()' Also, the doc says the valid characters are '' {} () [] <> but that should include "double quotes" And just why are there so many quote characters? ** Topic 8: It took me a while to figure out that ajacd.c did the ACD parsing. The file ajnam.c parses the .embosssrc and emboss.defaults which is described in http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Usa/databases.html and is *almost* in ACD format. The difference is that it doesn't have an 'application' term and the 'DB' needs to be 'DB:'. I can tweak my parser to handle the 'DB' term, but why can't those two files really be in ACD format? ... Although implementation-wise the ajacd.c file uses static variables so it can only be used to parse one file. I noticed a couple problems with how ajname.c works. - It only understands a comment as a '#' in the first character position (while ajacd.c recognizes it anywhere). - The code uses "fgets(line, 512, file)" which looks like it can fail if the line is more than 511 characters, as with long file names. (Actually, since this is a completely different implementation of the parser, the failure conditions are different. For example, there is a namNoColon in ajname.c, but nothing to strip a '='.) ** Topic 9 There needs to be some clarification on the license. When I looked at the code I read the top-level "COPYING" file, which is the GPL. I have a policy not to look at GPL'ed code too closely, since I worry that it may contaminate my ability to write equivalent non-GPL code, like the BSDed Biopython code. LGPL is not quite as bad, but even then I write the non-FSF-licensed code first then if needed for verification I look at the LGPL'ed code. Since the top-level COPYING file is the GPL, that put me off looking at any of the source code, even for verifying format requirements. It had to be pointed out to me that the ajax and nucleus codes are covered under the LGPL. I would not have discovered that on my own, because it the multiple license use wasn't mentiond in the README. In addition, I noticed the ./LICENSE is slightly different than the current Version 2, June 1991 one from the FSF. The FSF address is wrong, and there are formatting changes. I cannot tell if there are any text changes. I also noticed ./COPYING file is the GPL, except for a change in the address and the exclusion of the section "How to Apply These Terms to Your New Programs" Shouldn't these be identical files, and match the current FSF GPL? ** Topic 10 What does the 'warnrange' attribute of an integer do? (I've only lightly scanned the table of data types so will likely have more questions about the other fields in the future.) ** Topic 11 In scanning the code I noticed there is an indirection layer, which I assume is to isolate the programmer from changes in the OS and C library. It isn't used everywhere. For example, there's an ajNamGetenv but several places call getenv directl. I also did a scan looking for possible overflows and other security problems. Because of my inexperience with the indirection layer I couldn't do an in-depth check, but I did notice that ajStrFromFloat and ajStrFromDouble can fail on Inf, -Inf and NaN, for a couple of reasons: % cat inf.c #include #include main() { /* float val = -1.0/0.0; */ float val = strtod("-inf", NULL); char s[100]; int precision = 0, ival, i; sprintf(s, "val == >>%.0f<<", val); puts(s); ival = abs((int) val); printf("ival = %d\n", ival); if (ival) i = precision + (int) log10((double)ival) + 4; else i = precision + 4; printf("i == %d\n", i); } % cc inf.c -lm % ./a.out val == >>-inf<< ival = -2147483648 i == -2147483644 % ** Topic 12 Here's my first pass of the BNF for the ACD file. There are various things to fix, some of which are noted. This can be used for every file in the emoss/acd directory except qatest.acd (which contains a syntax error that acdc doesn't catch -- the "int bint" field) and testplot.acd (contains an '=' instead of a ':', which I don't yet handle). Lexer: colon = ":" open_block = "\[" close_block = "\]" endsection = "endsection" key = "(?!endsection)[a-zA-Z0-9_]+(?=[\s:\][])" value = "[a-zA-Z0-9_]+(?![\s:\][])[^\s\]]* | [^\000-\037a-zA-Z0-9_:[\]\s][^\s\]]*" quoted = '"[^"]*"' (only handles double quotes - need to fix) comment = "[#][^\n\r]*(\r|\r?\n)" SKIPPED whitespace = "\s+" SKIPPED Parser: (need to update the names to match the syntax doc) application ::= widget_list widget_list ::= widget | widget widget_list widget ::= key colon key open_block arglist close_block | key colon key key | key colon key value | endsection colon key arglist ::= arg | arg arglist arg ::= key colon key | key colon value ** Topic 13 One last thing. The parameter information for the different ACD data types is hard coded in ajacd.c. If it was stored in an external data file (in ACD format with well-defined fields :) then my Python code could read that meta-information to build up its tables, rather than me having to code it all by hand. Hope this wasn't too much at once :) Andrew dalke at dalkescientific.com From 962856211 at tay.ac.uk Tue Aug 7 15:17:47 2001 From: 962856211 at tay.ac.uk (962856211 at tay.ac.uk) Date: Tue, 7 Aug 2001 16:17:47 +0100 Subject: free downloads? Message-ID: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk> The list of programs you have at http://www.uk.embnet.org/Software/EMBOSS/Apps/ is it a list of freedownloads? Barry Marshall BSC Hons -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwilliam at hgmp.mrc.ac.uk Tue Aug 7 15:27:39 2001 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Tue, 07 Aug 2001 16:27:39 +0100 Subject: free downloads? References: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk> Message-ID: <3B7008EB.F4437A4F@hgmp.mrc.ac.uk> > 962856211 at tay.ac.uk wrote: > > The list of programs you have at > http://www.uk.embnet.org/Software/EMBOSS/Apps/ > > is it a list of freedownloads? > Barry Marshall BSC Hons This is a list of the applications in the EMBOSS package. The package can be downloaded for free (under the GPL licence) See: http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From dmartin at bioinformatics.msiwtb.dundee.ac.uk Tue Aug 7 16:22:31 2001 From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin) Date: Tue, 7 Aug 2001 17:22:31 +0100 (BST) Subject: free downloads? In-Reply-To: <000a01c11f54$24f2cb50$bcae3cc1@tay.ac.uk> Message-ID: On Tue, 7 Aug 2001 962856211 at tay.ac.uk wrote: > The list of programs you have at > http://www.uk.embnet.org/Software/EMBOSS/Apps/ > > is it a list of freedownloads? EMBOSS is a freely downloadable package licensed under the GPL/LGPL. You will probably want a unix/linux system on which to install it. The admin guide describes in excruciating detail how to do this (look in the documentation section of the web site). If you are in Dundee (at least your email address is) then drop by if you have any questions. ..d ---------------------------------- David Martin PhD Bioinformatics Scientific Officer Wellcome Trust Biocentre, Dundee ---------------------------------- From cbonnard at isrec-sg1.unil.ch Mon Aug 20 09:09:11 2001 From: cbonnard at isrec-sg1.unil.ch (Claude Bonnard) Date: Mon, 20 Aug 2001 11:09:11 +0200 Subject: Database access for EMBOSS Message-ID: <10108201109.ZM13075@isrec-sg1> Hello, It is not very surprising that SRS is the best mode for a fast access to the sequence databases from EMBOSS. As I understood, the URL mode allows the access to a SINGLE sequence and would not support the "USA" standard (wild card query) as SRS mode does. If it is the case, is there a solution when the SRS server is NOT on the same machine, but on a machine which is dedicated to SRS? I have in mind a rsh type of request and I would like to know if someone experience this type of problem and could help me in solving that. Thanks a lot Regards Claude -- Claude Bonnard Ph.D. ISREC (Swiss Institute for Experimental Cancer Research) Bioinformatics Group Ch des Boveresses 155 CH-1066 Epalinges Switzerland phone: [41-21]-692-5891/-2236 fax: [41-21]-652-6933 From peter.rice at uk.lionbioscience.com Mon Aug 20 09:20:55 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 20 Aug 2001 10:20:55 +0100 Subject: Database access for EMBOSS References: <10108201109.ZM13075@isrec-sg1> Message-ID: <3B80D677.99A6DCC8@uk.lionbioscience.com> Claude Bonnard wrote: > It is not very surprising that SRS is the best mode for a fast access > to the sequence databases from EMBOSS. As I understood, the URL mode > allows the access to a SINGLE sequence and would not support the > "USA" standard (wild card query) as SRS mode does. True. We could add an "SRSREMOTE" access mode to extend queries, easy to program but maybe limited practical use. > If it is the case, is there a solution when the SRS server is NOT > on the same machine, but on a machine which is dedicated to SRS? > I have in mind a rsh type of request and I would like to know if > someone experience this type of problem and could help me in > solving that. SRS access mode allows you to define the name of the getz program. How about an alternative name that is a script, and uses rsh to run a remote getz and returns the results? For example, if your script is called 'remotegetz' just add this to the database definition: app: remotegetz (you can use the full path if needed) Note: This was originally added because the Sanger Centre ran 2 versions of SRS (5.1 and 6.0) and I needed to switch between them, but it has other possible uses. -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From dmartin at bioinformatics.msiwtb.dundee.ac.uk Mon Aug 20 09:34:19 2001 From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin) Date: Mon, 20 Aug 2001 10:34:19 +0100 (BST) Subject: Database access for EMBOSS In-Reply-To: <3B80D677.99A6DCC8@uk.lionbioscience.com> Message-ID: On Mon, 20 Aug 2001, Peter Rice wrote: > Claude Bonnard wrote: > > It is not very surprising that SRS is the best mode for a fast access > > to the sequence databases from EMBOSS. As I understood, the URL mode > > allows the access to a SINGLE sequence and would not support the > > "USA" standard (wild card query) as SRS mode does. > > True. We could add an "SRSREMOTE" access mode to extend queries, easy to > program but maybe limited practical use. > > > If it is the case, is there a solution when the SRS server is NOT > > on the same machine, but on a machine which is dedicated to SRS? > > I have in mind a rsh type of request and I would like to know if > > someone experience this type of problem and could help me in > > solving that. > > SRS access mode allows you to define the name of the getz program. > > How about an alternative name that is a script, and uses rsh to run a > remote getz and returns the results? > > For example, if your script is called 'remotegetz' just add this to the > database definition: > > app: remotegetz > > (you can use the full path if needed) > > Note: This was originally added because the Sanger Centre ran 2 versions of > SRS (5.1 and 6.0) and I needed to switch between them, but it has other > possible uses. This would then allow one to add whichever script one wanted as long as it could parse srs style arguements.. It doesn't have to be SRS, just look like it.. The potential is there for wrapping in house rdbms with such a script. I'll add some more comments to the admin guide if Peter can send me details of how EMBOSS calls the wgetz program (not being much of an srs hacker myself). ..d ---------------------------------- David Martin PhD Bioinformatics Scientific Officer Wellcome Trust Biocentre, Dundee ---------------------------------- From peter.rice at uk.lionbioscience.com Mon Aug 20 09:50:22 2001 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 20 Aug 2001 10:50:22 +0100 Subject: Database access for EMBOSS References: Message-ID: <3B80DD5E.F61846E1@uk.lionbioscience.com> David Martin wrote: > > On Mon, 20 Aug 2001, Peter Rice wrote: > > For example, if your script is called 'remotegetz' just add this to the > > database definition: > > > > app: remotegetz > > This would then allow one to add whichever script one wanted as long > as it could parse srs style arguements.. > > It doesn't have to be SRS, just look like it.. The potential is there > for wrapping in house rdbms with such a script. > > I'll add some more comments to the admin guide if Peter can send me > details of how EMBOSS calls the wgetz program (not being much of an srs > hacker myself). This is getz, not wgetz. It supports the full SRS query language because it calls getz (or a user defined script) with an SRS query constructed from the USA. But there is also an access method in general for external applications. You can use this to set up RDBMS calls - which anyway was the original intention. At present it picks up dbname:id or dbname:acc as the rest of the command line, or puts the id/accession into a formatted string (if the application definition includes %s), but can easily be adapted further. -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From cutler at tularik.com Mon Aug 27 18:34:15 2001 From: cutler at tularik.com (Gene Cutler) Date: Mon, 27 Aug 2001 11:34:15 -0700 Subject: drawing trees Message-ID: Hello, all. I have a question about phylogenetic-type trees for sequences. I haven't quite figured out how to do this using emboss/phylip. This is how I have been doing this with gcg: perform multiple sequence alignment (generally hmmalign or clustalw) convert file to msf format if not already (e.g., sreformat from hmmer package) run gcg program distances on the msf file run gcg program growtree on the distances file end up with a postscript file How would I do this with PHYLIP instead? Thanks. From stein at fieldmuseum.org Mon Aug 27 19:20:00 2001 From: stein at fieldmuseum.org (Jennifer Steinbachs) Date: Mon, 27 Aug 2001 14:20:00 -0500 (CDT) Subject: drawing trees In-Reply-To: Message-ID: Use your favourite alignment program... Put your aligned sequences into PHYLIP format Run the appropriate phylip program... distance-based methods: protdist (for proteins) dnadist (for dna) parsimony protpars dnapars likelihood dnaml or dnamlk protml See the phylip website for more info (http://evolution.genetics.washington.edu/phylip.html). If you aren't certain of the differences between the different tree-building algorithms, you should get your hands on Nei and Kumar 2000 Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good reference for phylogenetics is Hillis et al 1996 Molecular Systematics ISBN 0-87893-282-8 Chapter 11. -jennifer On Mon, 27 Aug 2001, Gene Cutler wrote: >Hello, all. I have a question about phylogenetic-type trees for >sequences. I haven't quite figured out how to do this using >emboss/phylip. This is how I have been doing this with gcg: > >perform multiple sequence alignment (generally hmmalign or clustalw) >convert file to msf format if not already (e.g., sreformat from hmmer package) >run gcg program distances on the msf file >run gcg program growtree on the distances file >end up with a postscript file > >How would I do this with PHYLIP instead? > >Thanks. > > ----------------------------------- J. Steinbachs, Ph.D. Computational Biologist http://compbiology.org ----------------------------------- From cutler at tularik.com Mon Aug 27 20:19:56 2001 From: cutler at tularik.com (Gene Cutler) Date: Mon, 27 Aug 2001 13:19:56 -0700 Subject: drawing trees In-Reply-To: References: Message-ID: Thanks Jennifer. One more question: >Put your aligned sequences into PHYLIP format I didn't see any information on the phylip webpage about phylip format and/or conversion tools. I can do the conversion myself if I can find documentation on the format. >If you aren't certain of the differences between the different >tree-building algorithms, you should get your hands on Nei and Kumar 2000 >Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good >reference for phylogenetics is Hillis et al 1996 Molecular Systematics >ISBN 0-87893-282-8 Chapter 11. That's useful too. Thanks again. -- -=-=-=-=-=-=-=-=-=-=-=-=-=- Gene Cutler Bioinformatics Scientist cutler at tularik.com - - - - - - - - - - - - - Tularik Inc 2 Corporate Drive South San Francisco, CA 94080, USA http://www.tularik.com -=-=-=-=-=-=-=-=-=-=-=-=-=- From stein at fieldmuseum.org Mon Aug 27 21:09:55 2001 From: stein at fieldmuseum.org (Jennifer Steinbachs) Date: Mon, 27 Aug 2001 16:09:55 -0500 (CDT) Subject: drawing trees In-Reply-To: Message-ID: I like to use Seaview to make the conversion (I don't have the website handy but a google search should produce it quickly). ClustalX (and maybe clustalw) also produce Phylip files. I thought the Phylip website had information on the format, but it's been a while since I've actually perused the documentation. The phylip docs should definitely have complete information. If I recall correctly, it is something like: #sequences #nucleotide_sites sequence_name sequence sequence_name sequence etc. There used to be a 10 character limit on sequence_name, but I don't know if that holds with the latest version - I use PAUP* mostly for my analyses. Sequence can be non-interleaved or interleaved. -jennifer On Mon, 27 Aug 2001, Gene Cutler wrote: >Thanks Jennifer. One more question: > >>Put your aligned sequences into PHYLIP format > >I didn't see any information on the phylip webpage about phylip format >and/or conversion tools. I can do the conversion myself if I can find >documentation on the format. > >>If you aren't certain of the differences between the different >>tree-building algorithms, you should get your hands on Nei and Kumar 2000 >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics >>ISBN 0-87893-282-8 Chapter 11. > >That's useful too. Thanks again. > > > -- ----------------------------------- J. Steinbachs, Ph.D. Computational Biologist http://compbiology.org ----------------------------------- From jrvalverde at cnb.uam.es Tue Aug 28 06:05:15 2001 From: jrvalverde at cnb.uam.es (jrvalverde at cnb.uam.es) Date: Tue, 28 Aug 2001 08:05:15 +0200 (DST) Subject: drawing trees In-Reply-To: Message-ID: <200108280605.f7S65GE1348757@embnet.cnb.uam.es> Gene Cutler wrote: > Thanks Jennifer. One more question: > > >Put your aligned sequences into PHYLIP format > > I didn't see any information on the phylip webpage about phylip format > and/or conversion tools. I can do the conversion myself if I can find > documentation on the format. It's on the package documentation, but if you are already using CLUSTAL, then simply go to "multiple alignments", choose "output options" and then select "PHYLIP" format. As for the details, either look at the documentation ("main.doc") or to EMBnet's Quick Guide to PHYLIP (PDF and HTML versions may still be found at http://www.es.embnet.org/~pprpc/activs/PHYLIPGuide/PhylipGuide-1.6.html j From frank at bioss.ac.uk Tue Aug 28 07:45:33 2001 From: frank at bioss.ac.uk (Frank Wright) Date: Tue, 28 Aug 2001 08:45:33 +0100 Subject: drawing trees References: Message-ID: <3B8B4C1D.40315E3C@bioss.ac.uk> Hi Gene, >I didn't see any information on the phylip webpage about phylip format >and/or conversion tools. I can do the conversion myself if I can find >documentation on the format. PHYLIP FORMAT ------------- PHYLIP format is discussed in the PHYLIP documentation "main" file: http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html See section 6.Overview of the input and output formats/ subsection 1.Input File Format See the PHYLIP "sequences" documentation for details of how PHYLIP codes unknowns and gaps: http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html READSEQ ------- The READSEQ program is very useful for converting between alignment formats (you may have to edit "." to be "-" though as PHYLIP codes gaps differently). ftp://ftp.bio.indiana.edu/molbio/readseq/ To use: Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a There is an old version and a Java version. I prefer the old one :-). Other programs -------------- Jennifer Steinbachs has suggested some programs for reformatting alignments. Here's some additional comments: There are other programs that can reformat alignment files. CLUSTALW can be used to read in alignments (option 1) and write them out (option 2, suboption 9) without doing an alignment. I've not used SEQIO but it looks useful: http://bioweb.pasteur.fr/docs/seqio/seqio.html On a PC you could use "export" and "import" facilities in GENEDOC, an excellent alignment editor. http://www.psc.edu/biomed/genedoc/ Best Wishes, Frank -- Frank Wright Biomathematics and Statistics Scotland, SCRI, DUNDEE DD2 5DA, Scotland frank at bioss.sari.ac.uk From letondal at pasteur.fr Tue Aug 28 07:51:50 2001 From: letondal at pasteur.fr (Catherine Letondal) Date: Tue, 28 Aug 2001 09:51:50 +0200 Subject: drawing trees In-Reply-To: Your message of "Tue, 28 Aug 2001 08:45:33 BST." <3B8B4C1D.40315E3C@bioss.ac.uk> Message-ID: <200108280751.f7S7poM220418@electre.pasteur.fr> Frank Wright wrote: > Hi Gene, > > >I didn't see any information on the phylip webpage about phylip format > >and/or conversion tools. I can do the conversion myself if I can find > >documentation on the format. Our Phylip Web server (http://bioweb/seqanal/phylogeny/phylip-uk.html) may help with format conversion as well as phylogenetic programs chaining. > > PHYLIP FORMAT > ------------- > > PHYLIP format is discussed in the PHYLIP documentation "main" file: > > http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-main.html > > See section 6.Overview of the input and output formats/ subsection > 1.Input File Format > See the PHYLIP "sequences" documentation for details of how PHYLIP codes > unknowns and gaps: > > http://www.hgmp.mrc.ac.uk/Menu/Help/phylip-sequence.html > > READSEQ > ------- > The READSEQ program is very useful for converting between alignment > formats (you may have to edit "." to be "-" though as PHYLIP codes gaps > differently). > > ftp://ftp.bio.indiana.edu/molbio/readseq/ > > To use: > > Unix% readseq mygene.msf -format=phylip -output=mygene.phy -a > > There is an old version and a Java version. I prefer the old one :-). > > > Other programs > -------------- > > Jennifer Steinbachs has suggested some programs for reformatting > alignments. Here's some additional comments: > > There are other programs that can reformat alignment files. CLUSTALW > can be used to read in alignments (option 1) and write them out (option > 2, suboption 9) without doing an alignment. I've not used SEQIO but it > looks useful: > > http://bioweb.pasteur.fr/docs/seqio/seqio.html > > On a PC you could use "export" and "import" facilities in GENEDOC, an > excellent alignment editor. > > http://www.psc.edu/biomed/genedoc/ > > > > Best Wishes, > Frank > -- > Frank Wright > Biomathematics and Statistics Scotland, > SCRI, DUNDEE DD2 5DA, Scotland > frank at bioss.sari.ac.uk -- Catherine Letondal -- Pasteur Institute Computing Center From frank at bioss.ac.uk Tue Aug 28 08:04:13 2001 From: frank at bioss.ac.uk (Frank Wright) Date: Tue, 28 Aug 2001 09:04:13 +0100 Subject: drawing trees References: Message-ID: <3B8B507D.B7639FE5@bioss.ac.uk> Hi All, Gene Cutler asked: >Hello, all. I have a question about phylogenetic-type trees for >sequences. I haven't quite figured out how to do this using >emboss/phylip. This is how I have been doing this with gcg: > >run gcg program distances on the msf file >run gcg program growtree on the distances file > >How would I do this with PHYLIP instead? The GCG DISTANCES program and GCG GROWTREE programs are very similar to the DNADIST/PROTDIST and Neighbor programs in PHYLIP. In other words, they allow phylogenetic trees to be constructed using "distance-based" methods, but do not allow maximum likelihood or parsimony methods to be used. They also don't do bootstrapping tests, tree comparisons, and lots of other things. If you are using distance-based phylogenetic methods, some notes: (1) Weighted least-squares is slower but more accurate than Neighbor-Joining, so use the PHYLIP FITCH program instead of NEIGHBOR. (2) Recently, the PHYLIP-like WEIGHBOR program (Weighted Neighbor-Joining) has been released. WEIGHBOR appears to be an improvement on Neighbor-Joining (and possibly weighted least squares). See http://www.t10.lanl.gov/billb/weighbor/. I've not tried it out much but the simulations in the published paper look convincing. (3) PHYLIP (version 3.6) has improved DNADIST (more distance methods) and PROTDIST (rate heterogeneity among sites added). Best Wishes, Frank -- Frank Wright Biomathematics and Statistics Scotland, SCRI, DUNDEE DD2 5DA, Scotland frank at bioss.sari.ac.uk From dmartin at bioinformatics.msiwtb.dundee.ac.uk Tue Aug 28 08:26:00 2001 From: dmartin at bioinformatics.msiwtb.dundee.ac.uk (David Martin) Date: Tue, 28 Aug 2001 09:26:00 +0100 (BST) Subject: drawing trees In-Reply-To: Message-ID: Remember that in the Phylip programs distributed as an EMBASSY package, EMBOSS will do the sequence conversions for you. The programs to run are the same name with an e prepended. It also has the various options in ACD format so the programs can be fully scripted. The two programs that haven't been EMBOSSised are DRAWTREE and DRAWGRAM. ..d On Mon, 27 Aug 2001, Jennifer Steinbachs wrote: > > I like to use Seaview to make the conversion (I don't have the website > handy but a google search should produce it quickly). ClustalX (and maybe > clustalw) also produce Phylip files. I thought the Phylip website had > information on the format, but it's been a while since I've actually > perused the documentation. The phylip docs should definitely have > complete information. > > If I recall correctly, it is something like: > > #sequences #nucleotide_sites > sequence_name sequence > sequence_name sequence > > etc. > > There used to be a 10 character limit on sequence_name, but I don't know > if that holds with the latest version - I use PAUP* mostly for my > analyses. Sequence can be non-interleaved or interleaved. > > -jennifer > > On Mon, 27 Aug 2001, Gene Cutler wrote: > > >Thanks Jennifer. One more question: > > > >>Put your aligned sequences into PHYLIP format > > > >I didn't see any information on the phylip webpage about phylip format > >and/or conversion tools. I can do the conversion myself if I can find > >documentation on the format. > > > >>If you aren't certain of the differences between the different > >>tree-building algorithms, you should get your hands on Nei and Kumar 2000 > >>Molecular Evolution and Phylogenetics ISBN 0-19-513585-7. The other good > >>reference for phylogenetics is Hillis et al 1996 Molecular Systematics > >>ISBN 0-87893-282-8 Chapter 11. > > > >That's useful too. Thanks again. > > > > > > > > ---------------------------------- David Martin PhD Bioinformatics Scientific Officer Wellcome Trust Biocentre, Dundee ---------------------------------- From pscotney at hotmail.com Sat Aug 25 13:55:16 2001 From: pscotney at hotmail.com (Pierre Scotney) Date: Sat, 25 Aug 2001 23:55:16 +1000 Subject: [EMBOSS] EMBOSS and Jemboss installation problems SOLVED! Message-ID: Hello! I have solved the GNU/Linux EMBOSS and Jemboss installation problems :) The solution was: 1) edit both /etc/profile (for bash) and /etc/csh.login (for csh) so that $PATH includes /usr/local/lib/j2sdk1.4.2/bin path, previously only bash had the correct path to the java binaries. 2) use j2sdk1.4.2 as EMBOSS-2.8.0 will not build with j2sdk1.3.1 (Blackdown Java-Linux). May be the documentation/scripts will need to be changed to reflect this issue. Cheers Pierre -- Dr Pierre Scotney Melbourne Australia _________________________________________________________________ Get Extra Storage in 10MB, 25MB, 50MB and 100MB options now! Go to http://join.msn.com/?pgmarket=en-au&page=hotmail/es2