From fornai at biomed.unipi.it Thu Oct 3 06:27:40 2002 From: fornai at biomed.unipi.it (Claudia Fornai) Date: Thu, 3 Oct 2002 12:27:40 +0200 Subject: pepwindawall Message-ID: <000301c26ac7$b5b26a00$060e7283@ttvgroup> dear emboss I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to usa from a suitable UNIX platform pepwindowall and aother programs. Best regards, Claudia Fornai fornai at biomed.unipi.it -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021003/6f3c99a5/attachment.html From letondal at pasteur.fr Fri Oct 4 02:05:39 2002 From: letondal at pasteur.fr (Catherine Letondal) Date: Fri, 04 Oct 2002 08:05:39 +0200 Subject: pepwindawall In-Reply-To: Your message of "Thu, 03 Oct 2002 12:27:40 +0200." <000301c26ac7$b5b26a00$060e7283@ttvgroup> Message-ID: <200210040605.g9465duY106667@electre.pasteur.fr> "Claudia Fornai" wrote: > > dear emboss > I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to = > usa from a suitable UNIX platform pepwindowall and aother programs. > Best regards, > Claudia Fornai > > fornai at biomed.unipi.it > Hi Claudia, I guess that the documentation contains many answers to your question but if you use the Web interface provided here: http://bioweb.pasteur.fr/seqanal/interfaces/pepwindowall.html You will have the Unix command corresponding to your parameters displayed in the results page. Other EMBOSS programs are available from here: http://bioweb.pasteur.fr/intro-uk.html (where there are not only EMBOSS programs though) -- Catherine Letondal -- Pasteur Institute Computing Center From squiresb at macrogenics.com Fri Oct 4 13:48:47 2002 From: squiresb at macrogenics.com (Burke Squires) Date: Fri, 04 Oct 2002 12:48:47 -0500 Subject: Primer prediction problems... Message-ID: Hello all, I am trying to use EMBOSS to predict PCR primers. I have tried downloading the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar file and the primer3.0.9 tar and installing them. I get errors about a broken pipe or no primer3_core file found? Can I trouble someone to point out an install document on a website that lists a current set of instructions on installing EMBOSS and primer3 (or another primer prediction program)? Thanks in advance! Burke Squires From gwilliam at hgmp.mrc.ac.uk Mon Oct 7 04:17:32 2002 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Mon, 07 Oct 2002 09:17:32 +0100 Subject: Primer prediction problems... References: Message-ID: <3DA1431C.A139446B@hgmp.mrc.ac.uk> The primer3_core program needs to be on your path before you can run eprimer3. Gary Burke Squires wrote: > > Hello all, > > I am trying to use EMBOSS to predict PCR primers. I have tried downloading > the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar > file and the primer3.0.9 tar and installing them. I get errors about a > broken pipe or no primer3_core file found? > > Can I trouble someone to point out an install document on a website that > lists a current set of instructions on installing EMBOSS and primer3 (or > another primer prediction program)? > > Thanks in advance! > > Burke Squires -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From avc at sanger.ac.uk Mon Oct 7 06:50:40 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Mon, 07 Oct 2002 11:50:40 +0100 Subject: fasta splitter Message-ID: <3DA16700.90280280@sanger.ac.uk> Is there an emboss app to split a large fasta file into a set of smaller ones? I'm combing the docs but can't see anything - it may be staring me in the face... thanks Tony -- ############################################################## Email: avc at sanger.ac.uk # Webmaster,The Sanger Centre, Tel: 01223 497512 # Hinxton, CAMBRIDGE CB10 1SA. Fax: 01223 494919 # http://www.sanger.ac.uk/ ############################################################## From Thomas.Laurent at uk.lionbioscience.com Mon Oct 7 07:02:02 2002 From: Thomas.Laurent at uk.lionbioscience.com (Thomas Laurent) Date: Mon, 07 Oct 2002 12:02:02 +0100 Subject: fasta splitter References: <3DA16700.90280280@sanger.ac.uk> Message-ID: <3DA169AA.1040409@uk.lionbioscience.com> Hi tony, I think Splitter should do the job : http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html Cheers, Thomas Tony Cox wrote: > Is there an emboss app to split a large fasta file into a set of smaller ones? > I'm combing the docs but can't see anything - it may be staring me in the > face... > > thanks > > Tony > From avc at sanger.ac.uk Mon Oct 7 07:16:47 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Mon, 7 Oct 2002 12:16:47 +0100 (BST) Subject: fasta splitter In-Reply-To: <3DA169AA.1040409@uk.lionbioscience.com> Message-ID: On Mon, 7 Oct 2002, Thomas Laurent wrote: +>Hi tony, +>I think Splitter should do the job : +>http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html almost, but not quite. This converts one file to many files containg one sequence. I need something like a conversion of one file containing 1000 seqs to 10 files each containing 100 seqs Tony +> +>Cheers, +>Thomas +> +>Tony Cox wrote: +>> Is there an emboss app to split a large fasta file into a set of smaller ones? +>> I'm combing the docs but can't see anything - it may be staring me in the +>> face... +>> +>> thanks +>> +>> Tony +>> +> +> ****************************************************** Tony Cox Email:avc at sanger.ac.uk Sanger Institute WWW:www.sanger.ac.uk Wellcome Trust Genome Campus Webmaster Hinxton Tel: +44 1223 834244 Cambs. CB10 1SA Fax: +44 1223 494919 ****************************************************** From jweiner1 at ix.urz.uni-heidelberg.de Mon Oct 7 07:47:33 2002 From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3) Date: Mon, 7 Oct 2002 13:47:33 +0200 (METDST) Subject: fasta splitter In-Reply-To: Message-ID: Hello, > almost, but not quite. This converts one file to many files containg one > sequence. I need something like a conversion of one file containing 1000 > seqs to 10 files each containing 100 seqs I wrote you a simple perl script which should do the job. Save it to a file and make it executable (I think you are using a Unix-based system, aren't you?) with chmod a+x split.pl. To be on the safe side, put it in a new directory, and copy your sequence file to the same directory. Now run ./split.pl ...where filename is the name of the file containing your 1000+ sequences, and is the number of sequences you wish to have in each produced file. The produced file will have the same name as the original file with the appendix .1, .2, .3 etc. I tried the script and it seems to work fine. Meet the power of Perl :-) Regards, j. ----)-\//-///-----------------------------------January-Weiner-3------- Technologists often forget the general user. Technology is only as good as the user experience. That is something that technology groups very often forget... [ Linus Torvalds, taken from the GNOME Usability Project ] -------------- next part -------------- A non-text attachment was scrubbed... Name: split.pl Type: application/x-perl Size: 849 bytes Desc: Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021007/304bd4fb/attachment.bin From areagp61 at yahoo.it Mon Oct 7 08:49:35 2002 From: areagp61 at yahoo.it (Graziano P.) Date: Mon, 7 Oct 2002 14:49:35 +0200 Subject: Codon usage files Message-ID: <000b01c26e00$03ee27f0$18105709@italy.ibm.com> Hi all, with backtranseq I can use different codon usage table selecting different "codon usage files" in the EMBOSS data path. Some files are self-explanating (for example Ehuman.cut is the codon usage file name for Homo sapiens), but other files are not so self-explanating like Eacc.cut, Esma.cut, Eddi.cut, etc. Is there any document that report informations about every file? Thanks Graziano Pappad? ______________________________________________________________________ Mio Yahoo!: personalizza Yahoo! come piace a te http://it.yahoo.com/mail_it/foot/?http://it.my.yahoo.com/ From md0nilhe at mdstud.chalmers.se Mon Oct 7 09:10:33 2002 From: md0nilhe at mdstud.chalmers.se (Henrik Nilsson) Date: Mon, 7 Oct 2002 15:10:33 +0200 (MET DST) Subject: EMBASSY problem Message-ID: Hello I'm having major problems with compiling the PHYLIP package of EMBASSY. Would anyone happen to have compiled it successfully on RedHat 7.3, and would be willing to send me the executables? hENRiK -- Written using VIM - Vi IMproved version 5.0 http://www.vim.org From ableasby at hgmp.mrc.ac.uk Mon Oct 7 09:14:43 2002 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Mon, 7 Oct 2002 14:14:43 +0100 (BST) Subject: Codon usage files Message-ID: <200210071314.OAA29103@bromine.hgmp.mrc.ac.uk> Not every file but most are described in the README file from ftp://ftp.ebi.ac.uk/pub/databases/codonusage You can use the EMBOSS program 'cutgextract' on the CUTG database to get files with more meaningful (long) names. Alan From mathog at mendel.bio.caltech.edu Mon Oct 7 10:49:05 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Mon, 07 Oct 2002 07:49:05 -0700 Subject: fasta splitter Message-ID: > > Is there an emboss app to split a large fasta file into a set of smaller ones? > I'm combing the docs but can't see anything - it may be staring me in the > face... This isn't an EMBOSS entry, but it will probably do what you want: ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c There are some other fasta related utilities in the same directory. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From tmargus at ebc.ee Mon Oct 7 13:54:12 2002 From: tmargus at ebc.ee (=?iso-8859-1?Q?T=F5nu_Margus?=) Date: Mon, 7 Oct 2002 20:54:12 +0300 Subject: WWW - Emma is not able to create SOME temporary files Message-ID: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee> Hi, I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, but emma didn not work correctly. It gives an error: Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file It seems that by some reas?n it can not create a file under runs/temp directory. Why not - is for me unclea. All other files are there. Files under catalog runs/fileVxWbES$/) root at kobra:fileVxWbES$ ls -l total 16 -rw-r--r-- 1 www java 915 Oct 7 20:51 8825A -rw-r--r-- 1 www java 0 Oct 7 20:51 dendoutfile -rw-r--r-- 1 www java 384 Oct 7 20:51 error -rw-r--r-- 1 www java 2145 Oct 7 20:51 index.html drwxr-xr-x 2 www java 4096 Oct 7 20:51 input -rw-r--r-- 1 www java 0 Oct 7 20:51 outseq Command line clustalw works ok Is there a solution for this problem? T?nu Margus -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021007/0f661e07/attachment.html From starksb at ebi.ac.uk Mon Oct 7 14:58:15 2002 From: starksb at ebi.ac.uk (David Starks-Browning) Date: Mon, 7 Oct 2002 19:58:15 +0100 Subject: WWW - Emma is not able to create SOME temporary files In-Reply-To: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee> References: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee> Message-ID: <5473-Mon07Oct2002195815+0100-starksb@ebi.ac.uk> On Monday 7 Oct 02, T?nu Margus writes: > Hi, > > I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, > but emma didn not work correctly. > > It gives an error: > > Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file > > It seems that by some reas?n it can not create a file under runs/temp directory. > Why not - is for me unclea. All other files are there. > > Files under catalog runs/fileVxWbES$/) > > root at kobra:fileVxWbES$ ls -l > total 16 > -rw-r--r-- 1 www java 915 Oct 7 20:51 8825A > -rw-r--r-- 1 www java 0 Oct 7 20:51 dendoutfile > -rw-r--r-- 1 www java 384 Oct 7 20:51 error > -rw-r--r-- 1 www java 2145 Oct 7 20:51 index.html > drwxr-xr-x 2 www java 4096 Oct 7 20:51 input > -rw-r--r-- 1 www java 0 Oct 7 20:51 outseq > > Command line clustalw works ok You don't show the permissions of the directory itself (use 'ls -la'). It's the directory permissions that determine whether files can be created. However, this may not be the problem. We have seen problems with emma on Linux, because the underlying application, clustalw, cannot deal with filenames that are 5 characters long on Linux. String buffer management bugs in emma cause it to emit garbage characters after the filename to the open() system call. With emma, you will see this when emma's PID is 4 digits long. (You won't see the garbage characters in error messages. You only see them under strace.) Clustalw should be fixed. If that won't happen, emma.c could be modified to pad the temporary file name with enough extra characters so that, regardless of Linux PID, emma will use temp filenames longer than 5 characters. I don't have a patch for the latest version of emma, because I applied the workaround to an old (1.9.1) version of EMBOSS. Emma.c has changed a bit since then, although the change is still straightforward to apply. If you think this is your problem, I can provide details on how to modify emma.c. Hope this helps. Kind regards, David ------------------------------------------------------------------- David Starks-Browning | starksb at ebi.ac.uk EMBL Outstation -- | The European Bioinformatics Institute | Wellcome Trust Genome Campus | tel: +44 (1223) 494 616 Hinxton, Cambridge, CB10 1SD, UK | fax: +44 (1223) 494 468 ------------------------------------------------------------------- From tcarver at hgmp.mrc.ac.uk Tue Oct 8 04:44:34 2002 From: tcarver at hgmp.mrc.ac.uk (Tim Carver) Date: Tue, 08 Oct 2002 09:44:34 +0100 Subject: Jemboss Server Feedback Message-ID: <3DA29AF2.BE48E5F5@hgmp.mrc.ac.uk> It would be immensely useful if those who have setup a Jemboss server could provide some feedback to us. This is useful in providing some ideas for the future direction of its development and to give our funding body some idea of its usage at other sites. In particular the following information would be of use: 1. Nationality 2. Funding body and/or Organisation 3. Server Platform O/S (linux, solaris, MacOSX, AIX, HP-UX....) 4. Type of installation - e.g. with unix authorisation 5. Number of users at your site using Jemboss 6. Comments - what where you using before & why you changed - likes, dislikes & suggestions for Jemboss development (server & client) Many thanks in advance, Tim Carver HGMP-RC From mq1 at sanger.ac.uk Tue Oct 8 09:00:06 2002 From: mq1 at sanger.ac.uk (Mike Quail) Date: Tue, 8 Oct 2002 14:00:06 +0100 Subject: restriction mapping Message-ID: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk> Hi I am currently looking to isolate restriction fragments that cover gaps that are left in several genomes. To do this I need to cut the sequence we have of a genome with all known database enzymes and then select those that just cut a few times and in the right place so as to excise the region of the genome I require. GCG programs map and mapplot were excellent for doing this. Map in particular is good as it gives a graphical plot for each enzyme (one enzyme per line) plotting all the enzymes on a page or two so you can rapidly see which is appropriate. I have tried the EMBOSS programs and basically they are no use. REMAP does what I want but in too great detail (the output would stretch round the globe) and RESTRICT is too unordered in its output. I have got a program called oligo on my PC that will do this, BUT it has problems with big sequences. Recently I tried analysing a 1.5Mb chromosome and it would only work if I limited the number of enzymes to 6 or less. So I could transfer the data over to my PC and try with that but as this organism is 5Mb it will be very slow going. Have you any ideas of how this could be done in EMBOSS. M.Quail Project Leader Wellcome Trust Sanger Institute -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021008/3f570826/attachment.html From peter.rice at uk.lionbioscience.com Tue Oct 8 09:30:31 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 14:30:31 +0100 Subject: restriction mapping References: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk> Message-ID: <3DA2DDF7.70604@uk.lionbioscience.com> Mike Quail wrote: > I am currently looking to isolate restriction fragments that cover gaps > that are left in several genomes. To do this I need to cut the sequence > we have of a genome with all known database enzymes and then select > those that just cut a few times and in the right place so as to excise > the region of the genome I require. > > Have you any ideas of how this could be done in EMBOSS. You just need to know the enzymes that only cut twice, for example? % restrict -min 2 -max 2 -plasmid (the -plasmid may look odd, but it means "circular DNA" and says nothing about the size :-) You can also check each enzyme one at a time afterwards: % restrict -plasmid -fragment -enzyme BssHI ... the -fragment option includes the fragment sizes at the end of the report. You will need the positions and the fragment sizes to choose an enzyme. You can select other report formats (-rformat), but the default is probably the most useful for your case (-rformat EMBL or GFF, for example, will miss the -fragment output) Meanwhile, a graphical view could be nice so you can look for restriction sites on screen. We can look into that. Hope this helps, Peter Rice -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From jonas.andersson at rocketmail.com Tue Oct 8 10:35:50 2002 From: jonas.andersson at rocketmail.com (Jonas Andersson) Date: Tue, 8 Oct 2002 07:35:50 -0700 (PDT) Subject: Not compiling? Message-ID: <20021008143550.40367.qmail@web40110.mail.yahoo.com> When I try to compile the latest EMBOSS this is what I get. What do I do wrong, given that I do as is suggested on the EMBOSS pages? -MT ajreport.lo -MD -MP -MF .deps/ajreport.TPlo -o ajreport.o >/dev/null 2>&1 make[1]: *** [ajreport.lo] Error 1 make[1]: Leaving directory `/home/henrik/temp/emboss/EMBOSS-2.5.1/ajax' make: *** [all-recursive] Error 1 / Jonas __________________________________________________ Do you Yahoo!? Faith Hill - Exclusive Performances, Videos & More http://faith.yahoo.com From avc at sanger.ac.uk Tue Oct 8 11:10:22 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Tue, 8 Oct 2002 16:10:22 +0100 (BST) Subject: fasta splitter In-Reply-To: Message-ID: On Tue, 8 Oct 2002, January Weiner 3 wrote: Thanks to all that responded. I did, in the end write a 12 line bioperl script to split my fasta file. My request seems, however, to highlight a small blind spot on the EMBOSS radar. It appears that there are a number of implementations out there - perhaps one of them can be donated to the emboss project as the basis of a new software tool? Tony +>Hi, +> +>> This is apparently something that is frequently asked by biologists. +>> If you call it fastasplitter, I have a Web interface ready for it: +>> http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html +>> If you think it's interesting, I install it, and in such case, I will +>> put your name (J. Weiner ?) on the Web interface. +> +>No problem, do it, it's freeware (not even GPL :-). However, if you think +>that such a tool is useful, then I'll rewrite it in C -- to make it faster. +>If I may suggest -- it'd be nice if you could download or get the produced +>files as a tgz or zip archive. +> +>j. +> +>----)-\//-///-----------------------------------January-Weiner-3------- +>Wysz?a Ho?? i Czyst?, wr?ci?a Wsp?ln? i Nieca?? [ (C) by moja babcia ] +> +> ****************************************************** Tony Cox Email:avc at sanger.ac.uk Sanger Institute WWW:www.sanger.ac.uk Wellcome Trust Genome Campus Webmaster Hinxton Tel: +44 1223 834244 Cambs. CB10 1SA Fax: +44 1223 494919 ****************************************************** From Joerg.Schaber at uv.es Tue Oct 8 11:58:11 2002 From: Joerg.Schaber at uv.es (Joerg Schaber) Date: Tue, 08 Oct 2002 17:58:11 +0200 Subject: loading DDBJ data into EMBOSS Message-ID: <3DA30093.6080404@uv.es> Hi, i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using 'dbiflat -idformat gb'. I get a warning for all entries in the flatfile 'Warning: Duplicate ID skipped: '' All hits will point to first ID found? and I can not retrieve any sequence. I think dbiflat only recognizes the first entry. When I download the corresponding fasta flatfile I have no problems creating an EMBOSS database using 'dbifasta'. However, I would like to use the original DDBJ flatfile because it includes more information. Any idea what's the problem? greetings, joerg From peter.rice at uk.lionbioscience.com Tue Oct 8 12:08:47 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 17:08:47 +0100 Subject: loading DDBJ data into EMBOSS References: <3DA30093.6080404@uv.es> Message-ID: <3DA3030F.2030808@uk.lionbioscience.com> Joerg Schaber wrote: > Hi, > > i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. > ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using > 'dbiflat -idformat gb'. I get a warning for all entries in the flatfile > 'Warning: Duplicate ID skipped: '' All hits will point to first ID > found? and I can not retrieve any sequence. I think dbiflat only > recognizes the first entry. > When I download the corresponding fasta flatfile I have no problems > creating an EMBOSS database using 'dbifasta'. However, I would like to > use the original DDBJ flatfile because it includes more information. > Any idea what's the problem? Yes ... that file is not in Genbank or DDBJ format!!!! It looks more like a CODATA format, but only the ENTRY is recognized. If you can find a name for it, we could probably implements a new input/output sequence format ... but it has some horrible features that will not be general. Example entry: ENTRY BU002 CDS Buchnera NAME atpB DEFINITION ATP synthase A chain [EC:3.6.3.14] [SP:ATP6_BUCAI] CLASS Metabolism; Energy Metabolism; Oxidative phosphorylation [PATH:buc00190] Metabolism; Energy Metabolism; ATP synthesis [PATH:buc00193] Metabolism; Energy Metabolism; Photosynthesis [PATH:buc00195] POSITION 2278..3102 DBLINKS RIKEN: BU002 NCBI: 10038695 CODON_USAGE T C A G T 27 2 22 7 11 0 7 1 7 1 1 0 1 0 0 5 C 4 0 3 2 6 1 4 2 5 1 8 2 1 0 2 0 A 28 0 5 12 5 0 3 0 7 3 13 1 4 1 0 0 G 4 1 12 3 5 1 5 0 8 0 7 1 7 2 4 0 AASEQ 274 MILEKISDPQKYISHHLSHLQIDLRSFKIIQPGALSSDYWTVNVDSMFFSLVLGSFFLSI FYMVGKKITQGIPGKLQTAIELIFEFVNLNVKSMYQGKNALIAPLSLTVFIWVFLMNLMD LVPIDFFPFISEKVFELPAMRIVPSADINITLSMSLGVFFLILFYTVKIKGYVGFLKELI LQPFNHPVFSIFNFILEFVSLVSKPISLGLRLFGNMYAGEMIFILIAGLLPWWTQCFLNV PWAIFHILIISLQAFIFMVLTIVYLSMASQSHKD NTSEQ 825 atgattttagaaaagatatctgatcctcaaaaatatattagtcatcatttaagtcacttg cagatagatttgcgttcttttaaaattattcaaccaggtgcattgtcttctgattattgg actgtaaatgttgattcaatgtttttttctcttgtactgggtagtttttttttaagtatt ttttatatggtaggaaaaaaaattactcaaggtataccaggtaaattacaaactgcaatt gagttaatttttgaatttgtaaatttaaatgtaaaaagcatgtatcaaggtaaaaatgct cttattgcacctttatcattaacagtatttatttgggtttttttaatgaatctaatggat ttagttccgattgatttctttccatttatttctgaaaaagtgtttgaattacctgctatg cgaattgtaccttctgctgatattaatattacactatcaatgtcacttggcgtgtttttt ttaattttattttatactgttaaaattaaaggatatgtaggctttttaaaagaacttatt ttacaacctttcaaccatcctgtattttctatttttaattttatattagaatttgtgtca ttggtctcgaaacccatttctttgggattgcgattatttggaaacatgtacgcaggtgaa atgatttttattttaattgcaggtttgctgccatggtggacacaatgttttttaaacgta ccgtgggctatttttcatattttaataatttcactacaggcttttatttttatggtatta actattgtatatttatcaatggcctctcaatctcataaagattaa /// -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From peter.rice at uk.lionbioscience.com Tue Oct 8 12:37:36 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 17:37:36 +0100 Subject: fasta splitter References: Message-ID: <3DA309D0.7@uk.lionbioscience.com> Tony Cox wrote: > On Tue, 8 Oct 2002, January Weiner 3 wrote: > > Thanks to all that responded. I did, in the end write a 12 line bioperl script > to split my fasta file. My request seems, however, to highlight a small blind > spot on the EMBOSS radar. It appears that there are a number of implementations > out there - perhaps one of them can be donated to the emboss project as the > basis of a new software tool? Nobody suggested hacking "seqret" to do what you want... One problem doing this in EMBOSS is the need to generate filenames for your split files - but maybe a base filename would be enough to generate names. Then all you need to do is count sequences in a modified seqret.c and change the output file. You can add a command line option for the number of sequences in an output file. Cleaning up output files for a rerun is an exercise for the user (unless you want to invent a new ACD type that does it :-) Needs a modified version of the seqFileReopen function to handle the file naming, but nothing complicated is involved. regards. Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From avc at sanger.ac.uk Tue Oct 8 12:39:31 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Tue, 8 Oct 2002 17:39:31 +0100 (BST) Subject: fasta splitter In-Reply-To: <3DA309D0.7@uk.lionbioscience.com> Message-ID: On Tue, 8 Oct 2002, Peter Rice wrote: that sounds excellent - does this mean it really will make it in to the EMBOSS release? (any idea when? ;) Tony +>Tony Cox wrote: +>> On Tue, 8 Oct 2002, January Weiner 3 wrote: +>> +>> Thanks to all that responded. I did, in the end write a 12 line bioperl script +>> to split my fasta file. My request seems, however, to highlight a small blind +>> spot on the EMBOSS radar. It appears that there are a number of implementations +>> out there - perhaps one of them can be donated to the emboss project as the +>> basis of a new software tool? +> +>Nobody suggested hacking "seqret" to do what you want... +> +>One problem doing this in EMBOSS is the need to generate filenames for your +> split files - but maybe a base filename would be enough to generate +>names. Then all you need to do is count sequences in a modified seqret.c +>and change the output file. You can add a command line option for the +>number of sequences in an output file. Cleaning up output files for a rerun +>is an exercise for the user (unless you want to invent a new ACD type that +>does it :-) +> +>Needs a modified version of the seqFileReopen function to handle the file +>naming, but nothing complicated is involved. +> +>regards. +> +>Peter +> +>-- +>------------------------------------------------ +>Peter Rice, LION Bioscience Ltd, Cambridge, UK +>peter.rice at uk.lionbioscience.com +44 1223 224723 +> ****************************************************** Tony Cox Email:avc at sanger.ac.uk Sanger Institute WWW:www.sanger.ac.uk Wellcome Trust Genome Campus Webmaster Hinxton Tel: +44 1223 834244 Cambs. CB10 1SA Fax: +44 1223 494919 ****************************************************** From peter.rice at uk.lionbioscience.com Tue Oct 8 12:42:39 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 17:42:39 +0100 Subject: fasta splitter References: Message-ID: <3DA30AFF.3010100@uk.lionbioscience.com> Hi Tony > that sounds excellent - does this mean it really will make it in to the EMBOSS > release? (any idea when? ;) I already have the first part of the code ... a modified "seqret" to split into 10 sequences per file. Working copy is called "tenco" :-) What did you have in mind as a naming convention for the output files? The existing code names each file after the first sequence, I guess you want "outfile.1" "outfile.2" and so on, possibly with leading zeroes "outfile.,001" etc. Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From letondal at pasteur.fr Tue Oct 8 14:11:11 2002 From: letondal at pasteur.fr (Catherine Letondal) Date: Tue, 08 Oct 2002 20:11:11 +0200 Subject: fasta splitter In-Reply-To: Your message of "Mon, 07 Oct 2002 13:47:33 +0200." Message-ID: <200210081811.g98IBBuY253618@electre.pasteur.fr> January Weiner 3 wrote: > > Hello, > > > almost, but not quite. This converts one file to many files containg one > > sequence. I need something like a conversion of one file containing 1000 > > seqs to 10 files each containing 100 seqs > > I wrote you a simple perl script which should do the job. Save it to a > file and make it executable (I think you are using a Unix-based system, > aren't you?) with chmod a+x split.pl. To be on the safe side, put it in a > new directory, and copy your sequence file to the same directory. Now run > > ./split.pl > > ...where filename is the name of the file containing your 1000+ sequences, > and is the number of sequences you wish to have in > each produced file. The produced file will have the same name as the > original file with the appendix .1, .2, .3 etc. > > I tried the script and it seems to work fine. Meet the power of Perl :-) For information, I have installed this script and the program is available at: http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html -- Catherine Letondal -- Pasteur Institute Computing Center From avc at sanger.ac.uk Tue Oct 8 14:38:50 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Tue, 8 Oct 2002 19:38:50 +0100 Subject: fasta splitter References: <3DA30AFF.3010100@uk.lionbioscience.com> Message-ID: <000d01c26ef9$f206f710$0a00a8c0@zeus> ----- Original Message ----- From: "Peter Rice" To: "Tony Cox" Cc: "January Weiner 3" ; ; Sent: Tuesday, October 08, 2002 5:42 PM Subject: Re: fasta splitter > Hi Tony > > > that sounds excellent - does this mean it really will make it in to the EMBOSS > > release? (any idea when? ;) > > I already have the first part of the code ... a modified "seqret" to split > into 10 sequences per file. > > Working copy is called "tenco" :-) > > What did you have in mind as a naming convention for the output files? The > existing code names each file after the first sequence, I guess you want > "outfile.1" "outfile.2" and so on, possibly with leading zeroes > "outfile.,001" etc. Hi Peter, This sounds great to me. Personally, I'd prefer not to have the leading zeros - just an incrementing ".[integer]" appended to the filename supplied. Makes shell manipulation easier. I guess the ideal would able to supply either a number of chunks to split the file in to or else specify a maximum size (either in bytes or fasta entries) for each chunk. cheers Tony From mathog at mendel.bio.caltech.edu Tue Oct 8 15:00:12 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Tue, 08 Oct 2002 12:00:12 -0700 Subject: fasta splitter Message-ID: There's more than one way to split a fasta file... 1. Split M entries into N files, file 1 receives 1->M/N, file 2 receives M/N+1->2M/N, etc. Advantages - only one file needs to be open at a time, simple. Disadvantage - the resulting split is typically uneven. Do this with the NCBI databases and you'll find that they are heavily weighted for smaller sequences at the beginning and longer ones at the end. If the point of the split is to load balance (this is what I use it for, with parallel BLAST) some nodes will finish much earlier than others. Implementation: (deleted, I found this method not to be generally useful) 1b. head/tail/segment entries out of a fasta file. While (1) caused a lot of problems I've often needed to chop out a specific part of a fasta file. Why? Because some piece of software was blowing up on the 351,234 entry, but only if preceded by several thousand other entries. Finding the smallest piece that will trigger the bug can save hours of run time debugging these sorts of problems. Implementation: ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c 2. Split M entries into N files, cycling output to each file. That is, entry M goes to file M modulo N. Advantage - resulting files tend to be more even in size. Disadvantage - N output files must be open at once (or you have to cycle through N times, once per phase); if M is small and the size of each entry large the resulting files will not generally be balanced. Example, splitting the yeast genome, heaven help us when full length human chromosomes start showing up as single FASTA file entries. Implementation: ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c 3. Split P bases in M entries into N files "evenly", fragmenting sequences if they are too large. Advantage: fixes the genome data problem from (2). Disadvantages: even more complex than (2) and "entries" in resulting files do not correspond one to one with the original. Even with clever naming conventions (yeastII_100001_200000) end users will be confused. Clever names will be truncated by most software at the worst possible place resulting in a "hit" on "yeastII_" :-(. Implemenation: (well, partially, this one translates in all 6 frames, but it has some of the naming/fragmenting features): ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c 4. Split by content. Ie, strip all the human sequences out of nr. I don't beleive there is a general solution because there is no univerasally agreed upon FASTA header line format. Implementation: SRS or something similar. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From squiresb at macrogenics.com Tue Oct 8 15:20:35 2002 From: squiresb at macrogenics.com (Burke Squires) Date: Tue, 08 Oct 2002 14:20:35 -0500 Subject: eprimer3...broken pipe? In-Reply-To: Message-ID: I have tried to install various version of emboss and when I try and run eprimer3 I get the following message: [loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3 Picks PCR primers and hybridization oligos Input sequence(s): /bioinfo/fragments.fa Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out EMBOSS An error in eprimer3.c at line 317: The program 'primer3_core' must be on the path. It is part of the 'primer3' package, available from the Whitehead Institute. See: http://www-genome.wi.mit.edu/ Broken pipe Does anybody know how to fix this? Thanks! Burke Squires -- Burke Squires Bioinformatics MacroGenics, Inc. 2600 Stemmons Freeway, Suite 210 Dallas, TX 75235 USA Work: 214-634-3000 X224 Squiresb @ macrogenics.com (Please remove spaces to use) www.macrogenics.com ---------------------------------------------------------------------------- This e-mail and any attachments may be confidential or legally privileged. If you received this message in error or are not the intended recipient, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained herein. Please inform us of the erroneous delivery by return e-mail. Thank you for your cooperation. From tchiang at bioinfo.sickkids.on.ca Tue Oct 8 15:31:18 2002 From: tchiang at bioinfo.sickkids.on.ca (Ted Chiang) Date: Tue, 8 Oct 2002 15:31:18 -0400 (EDT) Subject: cusp Message-ID: Hi, I have question about the Emboss program cusp. The program creates a codon usage table based on the "coding" sequence of the input file. My question is how does it determine where the 'coding' (or ORF) sequence given any DNA sequence when one executes the program without specifying the -sbeg and -send flags. ie. $cusp dna_seq How does cusp determine where the coding sequence begins? As opposed to $cusp dna_seq -sbegin 135 -send 192 where the coding sequence is specified. In the latter case, how does if the specified region is not divisible by 3, does cusp ignore the latter few nucleotides? Thanks. -Ted ===================================== Ted Chiang, Analyst Centre for Computational Biology Hospital for Sick Children, Toronto 416.813.7028 tchiang at bioinfo.sickkids.on.ca ===================================== From sebastian.bassi at ar.advantaseeds.com Tue Oct 8 16:05:43 2002 From: sebastian.bassi at ar.advantaseeds.com (Sebastian Bassi) Date: Tue, 8 Oct 2002 22:05:43 +0200 Subject: fasta splitter Message-ID: > What did you have in mind as a naming convention for the > output files? The > existing code names each file after the first sequence, I > guess you want > "outfile.1" "outfile.2" and so on, possibly with leading zeroes > "outfile.,001" etc. My $.02: I think that outfile.[number_here] is not a good convention, since the extension (whatever you put after the dot) means the file type, and here the file type is always the same (ASCII text). I think it should be something like: outfile_[number].txt It should look like this: outfile_1.txt outfile_2.txt outfile_3.txt Anyway, IANAP (I am not a programmer) I'm just an end user and I'm stating this from a user consistency view point. If I have two mp3 files (xsongpart1 and xsongpart2) I would name them as part1.mp3 and part2.mp3 and NOT xsong.1 and xsong.2 From mathog at mendel.bio.caltech.edu Tue Oct 8 16:33:56 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Tue, 08 Oct 2002 13:33:56 -0700 Subject: fasta splitter Message-ID: > > What did you have in mind as a naming convention for the > > output files? The > > existing code names each file after the first sequence, I > > guess you want > > "outfile.1" "outfile.2" and so on, possibly with leading zeroes > > "outfile.,001" etc. > > My $.02: I think that outfile.[number_here] is not a good convention, since the extension (whatever you put after the dot) means the file type, and here the file type is always the same (ASCII text). I think it should be something like: > outfile_[number].txt > It should look like this: > outfile_1.txt I agree. Also the numeric range should be displayed in a fixed column width. Ideally something like: % esplit \ -sequence=ncbi_nr.nfa \ -fmask='nr_frag_####.nfa' \ -spitn=20 \ -splitmode=cycle \ -numberfrom=0 would produce nr_frag_0000.nfa ... nr_frag_0019.nfa Keeping the names fixed width prevents all sorts of text alignment problems which can show up otherwise. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From fernan at iib.unsam.edu.ar Tue Oct 8 18:51:16 2002 From: fernan at iib.unsam.edu.ar (Fernan Aguero) Date: Tue, 8 Oct 2002 19:51:16 -0300 Subject: fasta splitter In-Reply-To: <3DA309D0.7@uk.lionbioscience.com> References: <3DA309D0.7@uk.lionbioscience.com> Message-ID: <20021008225116.GA273@iib.unsam.edu.ar> +----[ Asi hablaba Peter Rice (peter.rice at uk.lionbioscience.com): | [ snipped ] | | One problem doing this in EMBOSS is the need to generate filenames for your | split files - but maybe a base filename would be enough to generate | names. Now let me get myself into the discussion. The splitter I use is called 'shatter' and is part of the SEALS package, which I guess is unmaintained (and perhaps obsolete?) and is basically perl. ftp://ftp.ncbi.nih.gov/pub/walker/seals/software The following discussion works for splitting into individual sequences, but not into groups of sequences. In this case a different naming scheme should be used, (though perhaps the same argument specifier '-word' could be used?). The approach of shatter (both for splitting FASTA files, but also for splitting concatenated BLAST reports, which are splitted by 'shatterblast') is to let you choose the 'word' which will be used as a basename. Both shatters know about the NCBI FASTA standard and thus, given a FASTA header like the following: >gi|123456|gb|AA123456|AA123456.1 Homo sapiens protein X etc will take the gi as word 2 (123456), the accession number (AA123456) as word 4, the accession.version (AA123456.1) as word 5 and so on. In the command-line you just say 'shatter -word 1 fastafile' if you want the first word after the '>' to be the basename. This produces files with that basename and terminated in .fa The program will consider whitespace and the character '|' as word delimiters. In my own experience this is a good thing. I've used shatter with many different FASTA flavours and adjusting the word to be used as basename is plain easy. BLAST reports are also trivial since query sequences, are also usually in FASTA format, and you get basically the same header, though after the 'Query=' magic word. In this case you get files with the same basename, but ending in .br Just my 2 cents. Hope this makes it into EMBOSS. Fernan | and change the output file. You can add a command line option for the | number of sequences in an output file. Cleaning up output files for a rerun | is an exercise for the user (unless you want to invent a new ACD type that | does it :-) | | Needs a modified version of the seqFileReopen function to handle the file | naming, but nothing complicated is involved. | | regards. | | Peter | | -- | ------------------------------------------------ | Peter Rice, LION Bioscience Ltd, Cambridge, UK | peter.rice at uk.lionbioscience.com +44 1223 224723 | | | +----] -- F e r n a n A g u e r o http://genoma.unsam.edu.ar/~fernan From gwilliam at hgmp.mrc.ac.uk Wed Oct 9 04:28:08 2002 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Wed, 09 Oct 2002 09:28:08 +0100 Subject: eprimer3...broken pipe? References: Message-ID: <3DA3E898.C9318CEE@hgmp.mrc.ac.uk> >From the eprimer3 documentation: The Whitehead program must be set up and on the path in order for eprimer3 to find and run it. The Whitehead Institute program that is run by this program is available from: http://www-genome.wi.mit.edu/genome_software/other/primer3.html (Then see the link 'Get release 0.9') The version that is run by this program is 3.0.9 currently available from: http://www-genome.wi.mit.edu/ftp/distribution/software/primer3_0_9_test.tar.gz Gary Burke Squires wrote: > > I have tried to install various version of emboss and when I try and run > eprimer3 I get the following message: > > [loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3 > Picks PCR primers and hybridization oligos > Input sequence(s): /bioinfo/fragments.fa > Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out > > EMBOSS An error in eprimer3.c at line 317: > The program 'primer3_core' must be on the path. > It is part of the 'primer3' package, > available from the Whitehead Institute. > See: http://www-genome.wi.mit.edu/ > Broken pipe > > Does anybody know how to fix this? > > Thanks! > > Burke Squires > > -- > Burke Squires > Bioinformatics > MacroGenics, Inc. > 2600 Stemmons Freeway, Suite 210 > Dallas, TX 75235 USA > Work: 214-634-3000 X224 > Squiresb @ macrogenics.com (Please remove spaces to use) > www.macrogenics.com > ---------------------------------------------------------------------------- > This e-mail and any attachments may be confidential or legally privileged. > If you received this message in error or are not the intended recipient, you > should destroy the e-mail message and any attachments or copies, and you are > prohibited from retaining, distributing, disclosing or using any information > contained herein. Please inform us of the erroneous delivery by return > e-mail. > > Thank you for your cooperation. -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From Joerg.Schaber at uv.es Wed Oct 9 13:02:51 2002 From: Joerg.Schaber at uv.es (Joerg Schaber) Date: Wed, 09 Oct 2002 19:02:51 +0200 Subject: swissprot Message-ID: <3DA4613B.3010901@uv.es> Hi, can't load the SWISSPROT- bacteria database (ftp://ftp.ebi.ac.uk/pub/databases/swissprot/special_selections/bacteria.seq) into EMBOSS. I think EMBOSS is running well because I have no problem accessing the test-databases (see showdb below). However, I think somehow seqret is using the wrong division file but the PATH-setting seem to be correct. greetings, joerg > dbiflat Index a flat file database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ REFSEQ : Refseq Entry format [SWISS]: SWISS Database directory [.]: Wildcard database filename [*.dat]: *.seq Database name: swissbac Release number [0.0]: 1.0 Index date [00/00/00]: 09/10/02 > ll insgesamt 132100 950883 drwxrwxr-x 2 root users 4096 Okt 9 18:50 . 623533 drwxrwxr-x 5 jos jos 4096 Okt 9 17:50 .. 950889 -rw-r--r-- 1 jos jos 189028 Okt 9 18:50 acnum.hit 950888 -rw-r--r-- 1 jos jos 660456 Okt 9 18:50 acnum.trg 623548 -rw-r--r-- 1 jos jos 133412511 Okt 9 18:25 bacteria.seq 950886 -rw-r--r-- 1 jos jos 322 Okt 9 18:50 division.lkp 950887 -rw-r--r-- 1 jos jos 836840 Okt 9 18:50 entrynam.idx > showdb Displays information on the currently available databases # Name Type ID Qry All Comment # ==== ==== == === === ======= swissbac P OK OK OK SWISSPROT sequences of procaryotes 9/10/02 tpir P OK OK OK PIR using NBRF access for 4 files tsw P OK OK OK Swissprot native format with EMBL CD-ROM index tswnew P OK OK OK Swissnew as 3 files in native format with EMBL CD-ROM index twp P OK OK OK EMBL new in native format with EMBL CD-ROM index buch N OK OK OK Buchnera database in DDBJ Format fbuch N OK OK OK Buchnera database in FASTA Format tembl N OK OK OK EMBL in native format with EMBL CD-ROM index tgb N OK - - Genbank IDs tgenbank N OK OK OK GenBank in native format with EMBL CD-ROM index > head bacteria.seq ID 120K_RICRI STANDARD; PRT; 1300 AA. AC P14914; --snipp --snipp > seqret swissbac:120K_RICRI Reads and writes (returns) sequences Warning: Cannot open division file '' for database 'swissbac' Warning: seqCdQry failed Error: Unable to read sequence 'swissbac:120K_RICRI' > -- ---------------------------------------------------------- Joerg Schaber Instituto Cavanilles de Biodiversidad y Genetica Evolutiva Universidad de Valencia Tel.: ++34 96 398 3647 A.C. 22085 Fax.: ++34 96 398 3670 46071 Valencia, Espa?a email : jos at uv.es From jweiner1 at ix.urz.uni-heidelberg.de Thu Oct 10 05:17:51 2002 From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3) Date: Thu, 10 Oct 2002 11:17:51 +0200 (METDST) Subject: fasta splitter In-Reply-To: <000d01c26ef9$f206f710$0a00a8c0@zeus> Message-ID: > This sounds great to me. Personally, I'd prefer not to have the leading > zeros - just an incrementing ".[integer]" appended to the filename supplied. > Makes shell manipulation easier. Well, I'd prefer the former -- because it makes shell manipulation easier :-) If you stay with the leading 0's, then any listing will show the files in the correct order, otherwise it will show "foo.1, foo.10, ..., foo.100,... foo.2, ..." etc. j. ----)-\//-///-----------------------------------January-Weiner-3------- "'Tis true, there's magic in the web of it." -- Shakespeare From kenneth at geisshirt.dk Mon Oct 14 07:02:15 2002 From: kenneth at geisshirt.dk (Kenneth Geisshirt) Date: Mon, 14 Oct 2002 13:02:15 +0200 (CEST) Subject: Splitting genbank Message-ID: Hi everyone I recently joined the mailing list (after a couple of weeks usage of EMBOSS) so I hope that my question isn't a FAQ. I have a local copy of genbank, and I wish to split it into four databases: one for humans, one of rats, one of mouses and one for the rest. The applications seqret and seqretsplit can help me with the first three by specifying the organism in the usa, but how do I specify "not human and not rat and not mouse"? Thanks in advance Kneth -- Kenneth Geisshirt, M.Sc., Ph.D. http://kenneth.geisshirt.dk Gr?ndals Parkvej 2A, 3. sal kenneth at geisshirt.dk DK-2720 Vanl?se +45 38 87 78 38 From peter.rice at uk.lionbioscience.com Mon Oct 14 07:27:34 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 14 Oct 2002 12:27:34 +0100 Subject: Splitting genbank References: Message-ID: <3DAAAA26.1080707@uk.lionbioscience.com> Kenneth Geisshirt wrote: > I have a local copy of genbank, and I wish to split it into four > databases: one for humans, one of rats, one of mouses and one for the > rest. The applications seqret and seqretsplit can help me with the first > three by specifying the organism in the usa, but how do I specify "not > human and not rat and not mouse"? In EMBOSS .... split the gbrod file into rat, mouse and other rodents (a simple perl script would do) index and define GenBank then define subsets using the same index files and exclude the ones you don't want using, for example: exclude: "*pri* *rat* *mus*" ... in copies of your EMBOSS database definition for genbank. EMBOSS simply checks the excluded files list when using the index files. regards, Peter Rice -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From Joerg.Schaber at uv.es Mon Oct 14 08:11:59 2002 From: Joerg.Schaber at uv.es (Joerg Schaber) Date: Mon, 14 Oct 2002 14:11:59 +0200 Subject: other indices Message-ID: <3DAAB48F.6080704@uv.es> Hi, dbiflat allows to index other fields except id and accession number like sequence version (seqv), description (des), keywords and taxon. However, in the example databases that come with EMBOSS I found only field definitions like 'fields: "sv des org key"'. So do I access the additional indices (e.g. in seqret) via 'seqret-sv:\*', 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*' did not work. Greetings, joerg From gwilliam at hgmp.mrc.ac.uk Mon Oct 14 08:18:20 2002 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Mon, 14 Oct 2002 13:18:20 +0100 Subject: other indices References: <3DAAB48F.6080704@uv.es> Message-ID: <3DAAB60C.ACF4111E@hgmp.mrc.ac.uk> See: http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/UniformSequenceAddress.html#keys You append the 'sv', 'des', 'org', 'key', etc to the database name with a '-' and to a file name with a ':', so: with a database you use a command like: seqret embl-des:fau with a file you use a command like: seqret filename:org:homo Gary Joerg Schaber wrote: > > Hi, > > dbiflat allows to index other fields except id and accession number like > sequence version (seqv), description (des), keywords and taxon. However, > in the example databases that come with EMBOSS I found only field > definitions like 'fields: "sv des org key"'. So do I access the > additional indices (e.g. in seqret) via 'seqret-sv:\*', > 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? > 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*' did not work. > > Greetings, > > joerg -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From peter.rice at uk.lionbioscience.com Mon Oct 14 08:20:40 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 14 Oct 2002 13:20:40 +0100 Subject: other indices References: <3DAAB48F.6080704@uv.es> Message-ID: <3DAAB698.1080108@uk.lionbioscience.com> Joerg Schaber wrote: > dbiflat allows to index other fields except id and accession number like > sequence version (seqv), description (des), keywords and taxon. However, > in the example databases that come with EMBOSS I found only field > definitions like 'fields: "sv des org key"'. So do I access the > additional indices (e.g. in seqret) via 'seqret-sv:\*', > 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? > 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*' did not work. For a database called schaber dbiflat -fields "acnum,seqvn,des,keyword,taxon" In the emboss.default definition: DB schaber [ type: P format: swiss method: emblcd dir: /data/schaber indexdir: /data/schaber comment: "Flatfiles database, all fields indexed" fields: "sv des org key" ] In EMBOSS programs, use the USA: 'schaber-sv:\*' 'schaber-des:\*' 'schaber-org:\*' 'schaber-key:\*' The confusion comes because the database definition (and the USA syntax) uses the field names in common use (e.g. in SRS) and dbiflat uses the EMBLCD/Staden index file names that dbiflat will be writing. regards, Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From jmuehlis at uni-muenster.de Tue Oct 15 05:03:19 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 15 Oct 2002 11:03:19 +0200 Subject: this format is not readable by seqret Message-ID: <3DABD9D7.4101AA0D@uni-muenster.de> Hello there, my name is J?rg M?hlisch and I work in the Departement of pediatric hematology and oncology at the University of Munster (Germany). As a Scientist I use emboss on linux. So here is my first question: I have a sample of sequences in different formats. Before I try to index them tested them for readablility by seqret: find ./ -name "*" -exec seqret -osf fasta {} ../Sequencesothers/{} /; Some of my files are not readable and I do not know the name of their format: Contig 1 (1,506) Contig Length: 506 bases Average Length/Sequence: 458 bases Total Sequence Length: 1375 bases Top Strand: 3 sequences Bottom Strand: 0 sequences Total: 3 sequences ^^ AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC... May be there is a way to change this format in an apropriate way. Thanks J?rg M?hlisch -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021015/6754baf0/attachment.vcf From peter.rice at uk.lionbioscience.com Tue Oct 15 05:16:32 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 15 Oct 2002 10:16:32 +0100 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> Message-ID: <3DABDCF0.6090809@uk.lionbioscience.com> Joerg Muehlisch wrote: > Some of my files are not readable and I do not know the name of their > format: > > Contig 1 (1,506) > Contig Length: 506 bases > Average Length/Sequence: 458 bases > Total Sequence Length: 1375 bases > Top Strand: 3 sequences > Bottom Strand: 0 sequences > Total: 3 sequences > ^^ > AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC... > > May be there is a way to change this format in an apropriate way. Should be possible, if the format is common enough. Where does the file come from? Does this program/package have an option to save in one of the (many) 'standard' formats? regards, Peter Rice -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From jmuehlis at uni-muenster.de Tue Oct 15 05:44:28 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 15 Oct 2002 11:44:28 +0200 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> Message-ID: <3DABE37C.B045FF6E@uni-muenster.de> Hi, in fact I hoped that anybody in the List would know where this format comes from. In my file sample I just found some of thes unreadable sequences. As it does not seem to be a good known format, I will try to find out where it is used. Thanks Jorg Peter Rice wrote: > Should be possible, if the format is common enough. > > Where does the file come from? Does this program/package have an option to > save in one of the (many) 'standard' formats? > > regards, > > Peter Rice > > -- > ------------------------------------------------ > Peter Rice, LION Bioscience Ltd, Cambridge, UK > peter.rice at uk.lionbioscience.com +44 1223 224723 -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021015/ff4ad8b4/attachment.vcf From kdj at sanger.ac.uk Tue Oct 15 06:38:46 2002 From: kdj at sanger.ac.uk (Keith James) Date: 15 Oct 2002 11:38:46 +0100 Subject: this format is not readable by seqret In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de> References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> Message-ID: >>>>> "Joerg" == Joerg Muehlisch writes: Joerg> Hi, in fact I hoped that anybody in the List would know Joerg> where this format comes from. In my file sample I just Joerg> found some of thes unreadable sequences. As it does not Joerg> seem to be a good known format, I will try to find out Joerg> where it is used. I _think_ this may be flatfile output from DNAStar/Lasergene. It's been a while since I've seen any files like that but the ^^ delimiter reminded me of it. I don't have acces to the package to verify this. Keith -- - Keith James bioinformatics programming support - - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK - From jrvalverde at cnb.uam.es Tue Oct 15 08:09:12 2002 From: jrvalverde at cnb.uam.es (José R. Valverde) Date: Tue, 15 Oct 2002 14:09:12 +0200 Subject: this format is not readable by seqret In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de> References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> Message-ID: <20021015140912.7294cd80.jrvalverde@cnb.uam.es> On Tue, 15 Oct 2002 11:44:28 +0200 Joerg Muehlisch wrote: > Hi, > > in fact I hoped that anybody in the List would know where this format > comes from. In my file sample I just found some of thes unreadable > sequences. > As it does not seem to be a good known format, I will try to find out > where it is used. > May be it would help if you were able to post a full file sample. >From the fragments you posted it looked like a sequencing project file. It mentioned a contig size, with many gel readings of average length and the orientation coverage of gels (+/- strands). Iff the sequence contained (you only included a few bases) is just the consensus, i.e. a single sequence of length exactly equal the consensus length, then conversion should be trivial to any format. Simply do a 'tail + 8 {}' Otherwise it might contain the gel readings (and the consensus?), and then it would be a multiple sequence file, possibly with gel overlaps et al. and conversion may be a bit more difficult. It may be also that more than one contig and associated files is included in one file, making processing more difficult. Initially I would expect the second choice to be true, from the header: several short sequences making up a contig plus the consensus, in your example, the first contig would be 506 bases, composed of three gels of average length 458. Since 1375/3 = 458, I deduce that the consensus sequence is not included. Therefore you have a multiple sequence file of overlapping gel readings. You may try this: 1) find out if more than one contig is in the file 2) find out how sequences are separated 3) decide what you want to do with them, e.g. split the file at "^Contig " lines strip comment lines (^*:*$) split at sequence separators see csplit(1) for details on how to do it on a pipeline. E.g. assuming sequences are delimited by a blank line, this _might_ work: csplit file /^Contig / -f config foreach i ( contig.* ) tail +8 $i | csplit - /\ \ / -f ${i}.gel end (note that we need to scape newlines directly) and you'd get the raw sequences all right as contig.##.gel.## j From jmuehlis at uni-muenster.de Tue Oct 15 09:50:03 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 15 Oct 2002 15:50:03 +0200 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> Message-ID: <3DAC1D0B.3ACD64FD@uni-muenster.de> Keith James wrote: Yes I think that might be, I think our collaboration Group is working with DNAStar. But nevertehless there does not seem to be an emboss way to change the file format. So I will try it with Linux tools like tr. Thanks for your help. Jorg > I _think_ this may be flatfile output from DNAStar/Lasergene. It's > been a while since I've seen any files like that but the ^^ delimiter > reminded me of it. > > I don't have acces to the package to verify this. > > Keith > > -- > > - Keith James bioinformatics programming support - > - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK - -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021015/6b7822b9/attachment.vcf From gbottu at ben.vub.ac.be Mon Oct 21 04:34:38 2002 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Mon, 21 Oct 2002 10:34:38 +0200 (CEST) Subject: question about fuzzpro and PROSITE Message-ID: <200210210834.KAA1459646@black.vub.ac.be> from : BEN Dear colleagues, While doing some experimenting with fuzzpro, I tried the following : ----------------- Input sequence(s): sw:pap?_carpa Search pattern: from : BEN Dear colleagues, I was looking at what the program prophecy is doing and I am puzzled. What is the difference between Gribskov and Henikoff profiles ? Both seem to have match/mismatch scores computed with the help of a scoring matrix as well as gap penalties. Furthermore, I thought that the Henikoff's made the Blocks databank using pprofiles without gaps. Can someone help me ? Guy Bottu From ableasby at hgmp.mrc.ac.uk Mon Oct 21 05:11:57 2002 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Mon, 21 Oct 2002 10:11:57 +0100 (BST) Subject: question about fuzzpro and PROSITE Message-ID: <200210210911.KAA28060@bromine.hgmp.mrc.ac.uk> Terminating full-stops are currently not part of the EMBOSS implementation of PROSITE patterns. Strictly they are, although unnecessary, part of the PROSITE syntax so we can accept them for future releases. For now if you just omit the '.' the pattern will work. Alan From gbottu at ben.vub.ac.be Mon Oct 21 05:36:05 2002 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Mon, 21 Oct 2002 11:36:05 +0200 (CEST) Subject: question about fuzzpro and PROSITE Message-ID: <200210210936.LAA1462808@black.vub.ac.be> Without the '.' it does not give an error. I get : ------------------ > fuzzpro Protein pattern search Input sequence(s): sw:pap?_carpa Search pattern: We'll look into that. Looks to be a boundary condition affecting zero length N terminal ranges. Thanks Alan From simon.andrews at bbsrc.ac.uk Mon Oct 21 10:00:35 2002 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 21 Oct 2002 15:00:35 +0100 Subject: Indexing Refseq Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28753@bi-exsrv1.iapc.bbsrc.ac.uk> I'm having all sorts of problems working with the latest release of RefSeq, due to a change in the way the files are being laid out. In older releases of RefSeq the LOCUS identifier was the same as the accession number (eg NM_0123456), but in the latest version the LOCUS identifier is the gene identifier, and these aren't unique in the database!! This means that when I run dbiflat (even using -idformat REFSEQ) I get a load of warnings about duplicate entries and when I later try to use the database I find that a load of entries are inaccessible because of this. For example accessions NM_134265,NM_134264 and NM_015626 all have the ID WSB1. How can I get dbiflat to index with the accession number as it's primary identifier so I don't lose entries when indexing them?? Thanks Simon PS This actually looks like a mistake by the RefSeq curators - I mean who thought that having a non-unique primary sequence identifier was a good idea!!! -- Simon Andrews PhD Bioinformatics Dept The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0)1223 496463 From simon.andrews at bbsrc.ac.uk Mon Oct 21 11:24:39 2002 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 21 Oct 2002 16:24:39 +0100 Subject: Indexing Refseq Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28754@bi-exsrv1.iapc.bbsrc.ac.uk> > -----Original Message----- > From: simon andrews (BI) [mailto:simon.andrews at bbsrc.ac.uk] > Subject: Indexing Refseq > > > I'm having all sorts of problems working with the latest > release of RefSeq > > This means that when I run dbiflat (even using -idformat > REFSEQ) I get a load of warnings about duplicate entries and > when I later try to use the database I find that a load of > entries are inaccessible because of this. > > For example accessions NM_134265,NM_134264 and NM_015626 all > have the ID WSB1. Just to follow up to myself - I've found a temporary work-round for this problem. The Bioperl script at the bottom of the message will pre-process the current Refseq files into a format which dbiflat can then index without errors. You will see a warning from the NC_xxxx chromosome files in Refseq, but as these are only features with no sequence I wasn't too worried about them and just skipped them. Usage of the script is "script_name [infile] > outfile". TTFN Simon. ------------------------------------------------------------- #!/usr/bin/perl -w use strict; use Bio::SeqIO; # This script is a filter through which we can # pass the whole of refseq. Newer versions of # refseq replaced their locus ID with a string # which wasn't the accession number. This # just changes them back. my ($filename) = @ARGV; die "No filename given" unless ($filename); my $in = Bio::SeqIO -> new(-file => $filename, -format => 'genbank'); die "Couldn't read $filename" unless ($in); my $out = Bio::SeqIO -> new(-fh => \*STDOUT, -format => 'genbank'); die "Couldn't make output pipe" unless ($out); while (my $seq = $in -> next_seq()){ # Some NC_xxx seqs are in the Refseq file # but don't have any sequence attached. We'll # skip those files... next if ($seq -> accession =~ /^NC/); $seq -> display_id($seq-> accession()); $out -> write_seq($seq); } #------------------------------------------------------- From jmuehlis at uni-muenster.de Tue Oct 22 04:06:55 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 22 Oct 2002 10:06:55 +0200 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> <3DAC1D0B.3ACD64FD@uni-muenster.de> Message-ID: <3DB5071F.EEB21A1@uni-muenster.de> Hi, Just for your information. This is the answer from my collaborators: The sequence is a DNAStar EditSeq file. The notation indicates that this sequence is consensus sequence from multiple reads put into a contig. If you do not have DNAStar, try to open with a wordprocessor program and cut and paste the sequence into whatever sequence editor you use. The sequence uses standard nomenclature (ie. W = A or T; M = A or C; etc.....) Thanks for your help. As this format is not readable I will now just change the format by other means. Jorg -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021022/d8b859d3/attachment.vcf From Andres.Aeschlimann at id.unibe.ch Tue Oct 22 11:23:31 2002 From: Andres.Aeschlimann at id.unibe.ch (Andres Aeschlimann) Date: Tue, 22 Oct 2002 17:23:31 +0200 (MET DST) Subject: Cannot connect! Message-ID: Hi all Having installed jemboss for the first time. There's still a problem left: After launching emboss from http://ubecx04.unibe.ch:8080/jemboss/Jemboss.jnlp ( a trial campus emboss server ) the webstart window appears as it should, and the login window as well, where username and password can be entered. Later on the window says Cannot connect! and a window "Check Public Server Settings" with the contents of the jemboss.properties file: user.auth=true jemboss.server=true server.public=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter server.private=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter service.public=JembossAuthServer service.private=JembossAuthServer plplot=/products/emboss/emboss/share/EMBOSS/ embossData=/products/emboss/emboss/share/EMBOSS/data/ embossBin=/products/emboss/emboss/bin/ embossPath=/usr/bin/:/bin:/packages/clustal/:/packages/primer3/bin: acdDirToParse=/products/emboss/emboss/share/EMBOSS/acd/ embossURL=http://www.uk.embnet.org/Software/EMBOSS/Apps/ appears. soap-2_3_1 and jakarta-tomcat-4.1.12 are installed as described in order to use with ftp://ftp.hgmp.mrc.ac.uk/pub/EMBOSS/patchfiles/install-jemboss-server.sh rpcrouter listens on https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter : SOAP RPC Router Sorry, I don't speak via HTTP GET- you have to use HTTP POST to talk to me. ubecx04:/products/emboss.222 % java -version java version "1.4.0_00" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_00-b05) Java HotSpot(TM) Client VM (build 1.4.0_00-b05, mixed mode) on Solaris 9. Is there any log file where the cause would be explained? Thanks in advance for any hint. Res ========================================================= Dr. Andres Aeschlimann Andres.Aeschlimann at id.unibe.ch University of Berne Gesellschaftsstrasse 6 CH-3012 BERNE tel: +41 31 631 3845 Switzerland fax: +41 31 631 3865 From gbottu at ben.vub.ac.be Thu Oct 24 10:20:43 2002 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Thu, 24 Oct 2002 16:20:43 +0200 (CEST) Subject: questions about codon usage tables Message-ID: <200210241420.QAA1196695@black.vub.ac.be> from : BEN Dear colleagues, I just took a look at codon usage tables under EMBOSS. - there is a list of tables in .../share/EMBOSS/data/CODONS. Unfortunately, they have rather cryptic names. Is there a way to find out for which organism they are ? And from which data source do they come ? - there is a program cutgextract. I tried it : > cutgextract Extract data from CUTG CUTG directory [.]: /db/cutg (here is the file cutg.dat) But it does ... nothing. Anyone a clue ? Sincerely, Guy Bottu From areagp61 at yahoo.it Fri Oct 25 05:03:36 2002 From: areagp61 at yahoo.it (Graziano P.) Date: Fri, 25 Oct 2002 11:03:36 +0200 Subject: -filter option for water and stretcher Message-ID: <001e01c27c05$690c7520$18105709@italy.ibm.com> Hi All, I need to introduce sequences by standard input. I have found the -filter qualifier in the -help -verbose options. For example, if I use this qualifier for "transeq" I write: transeq -filter then I have to insert my sequence (in fasta format for example) pasting or writing it. When I have finished writing or pasting the sequences, I have to press CTRL-D to terminate the standard input introduction. Finally the program return the standard output. I have tried to use the -filter qualifier with "water" and "stretcher". These two programs require two sequences in input in different files. If I write as standard input: >HTRE_ECOLI P33129 OUTER MEMBRANE USHER PROTEIN ... PGVYDVSVYVNDQPIINQSITFVAIEGKKNAQACITLKNLLQFHINSPDINNEKAVLLAR DETLGNCLNLTEIIPQASVRYDVNDQRLDIDVPQAWVMKNYQNYVDPSLWENGINAAMLS NDQRLDIDVP >YCJV_ECOLI P77481 HYPOTHETICAL ABC TRANSPORTER ... MAQLSLQHIQKIYDNQVHVVKDFNLEIADKEFIVFVGPSGCGKSTTLRMIAGLEEISGGD LLIDGKRMNDVPAKARNIAMVFQNYALYPHMTVYDNMAFGLKMQKIAKEVIDERVNWAAQ KISVAELTGAEFMLYTTVGGTS when I press CTRL-D I get the following error message: Error: Unable to read sequence '' How can I tell to standard input that what I paste or write are two different sequences? Is there any separator character that do it? Best regards Graziano ______________________________________________________________________ Scarica il nuovo Yahoo! Messenger: con webcam, nuove faccine e tante altre novit?. http://it.yahoo.com/mail_it/foot/?http://it.messenger.yahoo.com/ From aralp001 at udcf.gla.ac.uk Fri Oct 25 11:04:22 2002 From: aralp001 at udcf.gla.ac.uk (Dr Adam Ralph) Date: Fri, 25 Oct 2002 16:04:22 +0100 (BST) Subject: multi-page graphical output In-Reply-To: <3DA4613B.3010901@uv.es> Message-ID: Dear Anyone, I am trying to write a program which outputs a graph, similar to plotcon or cpgplot. It would appear that the way these programs are constructed, the graph is plotted on one page. Thus if you have a large sequence the graph looks a bit of a mess. Other types of graphical program (like prettyplot) which plot lines of text are able to alter the number of characters per line and produce multiple pages. My question is can someone show me or give me an example program which splits histogram/graph plots into multiple pages? Thus on one page you can have a graph of residues 1-1000, then graph of 1001-2000 etc. Thanks in advance Adam Dr. Adam Ralph Institute of Virology University of Glasgow Church Street Glasgow G11 5JR Phone: 0141 330 6268 Fax: 0141 337 2236 email: a.ralph at vir.gla.ac.uk From ggaz at cpqrr.fiocruz.br Wed Oct 9 17:19:56 2002 From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli) Date: Wed, 9 Oct 2002 18:19:56 -0300 Subject: jemboss Message-ID: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br> I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this. Could you help me? Thanks, Solange Busek Centro de Pesquisas Ren? Rachou/FIOCRUZ -- Esta mensagem foi "escaneada" pelo MailScanner a procura de virus e codigo malicioso, e acredita-se que esteja "limpa". Servico de Informatica - CPqRR/FIOCRUZ. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021009/9c1645cb/attachment.html From ggaz at cpqrr.fiocruz.br Wed Oct 9 17:15:32 2002 From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli) Date: Wed, 9 Oct 2002 18:15:32 -0300 Subject: jemboss Message-ID: <000801c26fd9$9e235fe0$6500a8c0@cpqrr.fiocruz.br> I would like to use the jemboss (interface java for emboss) but I need to enroll in HGPM and I don?t know how can I do this. Could you send me the email that I can do this? Thanks, Solange Busek Centro de Pesquisas Ren? Rachou/FIOCRUZ -- Esta mensagem foi "escaneada" pelo MailScanner a procura de virus e codigo malicioso, e acredita-se que esteja "limpa". Servico de Informatica - CPqRR/FIOCRUZ. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021009/0c18a953/attachment.html From tcarver at hgmp.mrc.ac.uk Mon Oct 28 13:15:35 2002 From: tcarver at hgmp.mrc.ac.uk (Dr T. Carver) Date: Mon, 28 Oct 2002 18:15:35 +0000 (GMT) Subject: jemboss In-Reply-To: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br> Message-ID: Hi You can register at the HGMP by filling out the form at: http://www.hgmp.mrc.ac.uk/About/Registration/ Then send it to: UK MRC HGMP Resource Centre Hinxton Cambridge CB10 1SB UK You will then be sent an HGMP username and password. Regards Tim Carver On Wed, 9 Oct 2002, Prof. Giovanni Gazzinelli wrote: > I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this. > Could you help me? > Thanks, > Solange Busek > Centro de Pesquisas Ren? Rachou/FIOCRUZ > > -- > Esta mensagem foi "escaneada" pelo MailScanner a procura > de virus e codigo malicioso, e acredita-se que esteja "limpa". > Servico de Informatica - CPqRR/FIOCRUZ. > > From David.Lapointe at umassmed.edu Mon Oct 28 17:21:55 2002 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Mon, 28 Oct 2002 17:21:55 -0500 Subject: Emboss on Solaris. Message-ID: <13B2F22F9D5DD611B07700508BB1E88F019A2D7A@edunivexch02.umassmed.edu> We've moved to a Netra T1 and I am having problems with the PNG libraries. I get these runtime errors using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am I missing? $ prettyplot Displays aligned sequences, with colouring and boxing Input sequence set: opsin.msf Graph type [x11]: png libpng warning: Application was compiled with png.h from libpng-1.0.6 libpng warning: Application is running with png.c from libpng-1.2.4 gd-png: fatal libpng error: Incompatible libpng version in application and library David Lapointe Senior Informaticist / Information Services Assistant Professor / Cell Biology UMass Worcester (508) 856-5141 From David.Bauer at SCHERING.DE Tue Oct 29 01:37:00 2002 From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE) Date: Tue, 29 Oct 2002 07:37:00 +0100 Subject: Antwort: Emboss on Solaris. Message-ID: Hi, I also had some problems with this on Solaris. Did you try to run configure with "--with-pngdriver=DIR"?. This helps EMBOSS to pick the right header files. David. We've moved to a Netra T1 and I am having problems with the PNG libraries. I get these runtime errors using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am I missing? $ prettyplot Displays aligned sequences, with colouring and boxing Input sequence set: opsin.msf Graph type [x11]: png libpng warning: Application was compiled with png.h from libpng-1.0.6 libpng warning: Application is running with png.c from libpng-1.2.4 gd-png: fatal libpng error: Incompatible libpng version in application and library David Lapointe Senior Informaticist / Information Services Assistant Professor / Cell Biology UMass Worcester (508) 856-5141 From shibl at seqbio.com Wed Oct 30 11:13:08 2002 From: shibl at seqbio.com (Shibl Mourad) Date: Wed, 30 Oct 2002 11:13:08 -0500 Subject: Emboss Expert System Message-ID: <002c01c2802f$3fec6370$2602a8c0@SEQUENCE> Dear EMBOSS user, We are currently developing an expert system that will complement EMBOSS. As there are roughly 200 tools packaged within EMBOSS alone, the task to locate the 'right' tool, especially if you are newcomer to the bioinformatics field, can be overwhelming. Our expert system, openExpert, aims to simulate the 'question and answer' conversation one would have with a bioinformatics 'expert' - but minus their presence and wage. Although it is currently populated with only the EMBOSS suite, we aim to broaden the knowledge base of openExpert to encompass all known bioinformatics tools. We are looking for 5 EMBOSS users to review the system. The review should not take more than 30 minutes of your time and it would be of great value to us. If you are interested, please email shibl at seqbio.com. If you would like to try openExpert without providing a review, please indicate so in your email and we will provide with free access. Help us make openExpert a valuable expert system for bioinformatics. Thank you, Shibl Mourad, President Sequence Bioinformatics From newgene at bigfoot.com Thu Oct 31 12:43:06 2002 From: newgene at bigfoot.com (clwu) Date: Thu, 31 Oct 2002 11:43:06 -0600 Subject: emboss in cygwin Message-ID: <3DC16BAA.1050201@bigfoot.com> Hi, group, I am new to group. I tried to compile EMBOSS under win2K/cygwin but I failed. EMBOSS website at HGMP mentioned that "Richard Bruskiewich and Simon Kelley at the Sanger Centre have succeeded in compiling EMBOSS under Windows NT using the CygWin package. The resulting executables have been tested but not thoroughly enough for a release. Contact Richard Bruskiewich for more information. ". But I can not follow the link in this page to get help. Does anyone have the successful experience on this? Are there pre-complied executables for cygwin available, even part of those standalone programs? That will help me a lot. Thank you in advance. clwu From fornai at biomed.unipi.it Thu Oct 3 10:27:40 2002 From: fornai at biomed.unipi.it (Claudia Fornai) Date: Thu, 3 Oct 2002 12:27:40 +0200 Subject: pepwindawall Message-ID: <000301c26ac7$b5b26a00$060e7283@ttvgroup> dear emboss I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to usa from a suitable UNIX platform pepwindowall and aother programs. Best regards, Claudia Fornai fornai at biomed.unipi.it -------------- next part -------------- An HTML attachment was scrubbed... URL: From letondal at pasteur.fr Fri Oct 4 06:05:39 2002 From: letondal at pasteur.fr (Catherine Letondal) Date: Fri, 04 Oct 2002 08:05:39 +0200 Subject: pepwindawall In-Reply-To: Your message of "Thu, 03 Oct 2002 12:27:40 +0200." <000301c26ac7$b5b26a00$060e7283@ttvgroup> Message-ID: <200210040605.g9465duY106667@electre.pasteur.fr> "Claudia Fornai" wrote: > > dear emboss > I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to = > usa from a suitable UNIX platform pepwindowall and aother programs. > Best regards, > Claudia Fornai > > fornai at biomed.unipi.it > Hi Claudia, I guess that the documentation contains many answers to your question but if you use the Web interface provided here: http://bioweb.pasteur.fr/seqanal/interfaces/pepwindowall.html You will have the Unix command corresponding to your parameters displayed in the results page. Other EMBOSS programs are available from here: http://bioweb.pasteur.fr/intro-uk.html (where there are not only EMBOSS programs though) -- Catherine Letondal -- Pasteur Institute Computing Center From squiresb at macrogenics.com Fri Oct 4 17:48:47 2002 From: squiresb at macrogenics.com (Burke Squires) Date: Fri, 04 Oct 2002 12:48:47 -0500 Subject: Primer prediction problems... Message-ID: Hello all, I am trying to use EMBOSS to predict PCR primers. I have tried downloading the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar file and the primer3.0.9 tar and installing them. I get errors about a broken pipe or no primer3_core file found? Can I trouble someone to point out an install document on a website that lists a current set of instructions on installing EMBOSS and primer3 (or another primer prediction program)? Thanks in advance! Burke Squires From gwilliam at hgmp.mrc.ac.uk Mon Oct 7 08:17:32 2002 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Mon, 07 Oct 2002 09:17:32 +0100 Subject: Primer prediction problems... References: Message-ID: <3DA1431C.A139446B@hgmp.mrc.ac.uk> The primer3_core program needs to be on your path before you can run eprimer3. Gary Burke Squires wrote: > > Hello all, > > I am trying to use EMBOSS to predict PCR primers. I have tried downloading > the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar > file and the primer3.0.9 tar and installing them. I get errors about a > broken pipe or no primer3_core file found? > > Can I trouble someone to point out an install document on a website that > lists a current set of instructions on installing EMBOSS and primer3 (or > another primer prediction program)? > > Thanks in advance! > > Burke Squires -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From avc at sanger.ac.uk Mon Oct 7 10:50:40 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Mon, 07 Oct 2002 11:50:40 +0100 Subject: fasta splitter Message-ID: <3DA16700.90280280@sanger.ac.uk> Is there an emboss app to split a large fasta file into a set of smaller ones? I'm combing the docs but can't see anything - it may be staring me in the face... thanks Tony -- ############################################################## Email: avc at sanger.ac.uk # Webmaster,The Sanger Centre, Tel: 01223 497512 # Hinxton, CAMBRIDGE CB10 1SA. Fax: 01223 494919 # http://www.sanger.ac.uk/ ############################################################## From Thomas.Laurent at uk.lionbioscience.com Mon Oct 7 11:02:02 2002 From: Thomas.Laurent at uk.lionbioscience.com (Thomas Laurent) Date: Mon, 07 Oct 2002 12:02:02 +0100 Subject: fasta splitter References: <3DA16700.90280280@sanger.ac.uk> Message-ID: <3DA169AA.1040409@uk.lionbioscience.com> Hi tony, I think Splitter should do the job : http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html Cheers, Thomas Tony Cox wrote: > Is there an emboss app to split a large fasta file into a set of smaller ones? > I'm combing the docs but can't see anything - it may be staring me in the > face... > > thanks > > Tony > From avc at sanger.ac.uk Mon Oct 7 11:16:47 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Mon, 7 Oct 2002 12:16:47 +0100 (BST) Subject: fasta splitter In-Reply-To: <3DA169AA.1040409@uk.lionbioscience.com> Message-ID: On Mon, 7 Oct 2002, Thomas Laurent wrote: +>Hi tony, +>I think Splitter should do the job : +>http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html almost, but not quite. This converts one file to many files containg one sequence. I need something like a conversion of one file containing 1000 seqs to 10 files each containing 100 seqs Tony +> +>Cheers, +>Thomas +> +>Tony Cox wrote: +>> Is there an emboss app to split a large fasta file into a set of smaller ones? +>> I'm combing the docs but can't see anything - it may be staring me in the +>> face... +>> +>> thanks +>> +>> Tony +>> +> +> ****************************************************** Tony Cox Email:avc at sanger.ac.uk Sanger Institute WWW:www.sanger.ac.uk Wellcome Trust Genome Campus Webmaster Hinxton Tel: +44 1223 834244 Cambs. CB10 1SA Fax: +44 1223 494919 ****************************************************** From jweiner1 at ix.urz.uni-heidelberg.de Mon Oct 7 11:47:33 2002 From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3) Date: Mon, 7 Oct 2002 13:47:33 +0200 (METDST) Subject: fasta splitter In-Reply-To: Message-ID: Hello, > almost, but not quite. This converts one file to many files containg one > sequence. I need something like a conversion of one file containing 1000 > seqs to 10 files each containing 100 seqs I wrote you a simple perl script which should do the job. Save it to a file and make it executable (I think you are using a Unix-based system, aren't you?) with chmod a+x split.pl. To be on the safe side, put it in a new directory, and copy your sequence file to the same directory. Now run ./split.pl ...where filename is the name of the file containing your 1000+ sequences, and is the number of sequences you wish to have in each produced file. The produced file will have the same name as the original file with the appendix .1, .2, .3 etc. I tried the script and it seems to work fine. Meet the power of Perl :-) Regards, j. ----)-\//-///-----------------------------------January-Weiner-3------- Technologists often forget the general user. Technology is only as good as the user experience. That is something that technology groups very often forget... [ Linus Torvalds, taken from the GNOME Usability Project ] -------------- next part -------------- A non-text attachment was scrubbed... Name: split.pl Type: application/x-perl Size: 849 bytes Desc: URL: From areagp61 at yahoo.it Mon Oct 7 12:49:35 2002 From: areagp61 at yahoo.it (Graziano P.) Date: Mon, 7 Oct 2002 14:49:35 +0200 Subject: Codon usage files Message-ID: <000b01c26e00$03ee27f0$18105709@italy.ibm.com> Hi all, with backtranseq I can use different codon usage table selecting different "codon usage files" in the EMBOSS data path. Some files are self-explanating (for example Ehuman.cut is the codon usage file name for Homo sapiens), but other files are not so self-explanating like Eacc.cut, Esma.cut, Eddi.cut, etc. Is there any document that report informations about every file? Thanks Graziano Pappad? ______________________________________________________________________ Mio Yahoo!: personalizza Yahoo! come piace a te http://it.yahoo.com/mail_it/foot/?http://it.my.yahoo.com/ From md0nilhe at mdstud.chalmers.se Mon Oct 7 13:10:33 2002 From: md0nilhe at mdstud.chalmers.se (Henrik Nilsson) Date: Mon, 7 Oct 2002 15:10:33 +0200 (MET DST) Subject: EMBASSY problem Message-ID: Hello I'm having major problems with compiling the PHYLIP package of EMBASSY. Would anyone happen to have compiled it successfully on RedHat 7.3, and would be willing to send me the executables? hENRiK -- Written using VIM - Vi IMproved version 5.0 http://www.vim.org From ableasby at hgmp.mrc.ac.uk Mon Oct 7 13:14:43 2002 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Mon, 7 Oct 2002 14:14:43 +0100 (BST) Subject: Codon usage files Message-ID: <200210071314.OAA29103@bromine.hgmp.mrc.ac.uk> Not every file but most are described in the README file from ftp://ftp.ebi.ac.uk/pub/databases/codonusage You can use the EMBOSS program 'cutgextract' on the CUTG database to get files with more meaningful (long) names. Alan From mathog at mendel.bio.caltech.edu Mon Oct 7 14:49:05 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Mon, 07 Oct 2002 07:49:05 -0700 Subject: fasta splitter Message-ID: > > Is there an emboss app to split a large fasta file into a set of smaller ones? > I'm combing the docs but can't see anything - it may be staring me in the > face... This isn't an EMBOSS entry, but it will probably do what you want: ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c There are some other fasta related utilities in the same directory. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From tmargus at ebc.ee Mon Oct 7 17:54:12 2002 From: tmargus at ebc.ee (=?iso-8859-1?Q?T=F5nu_Margus?=) Date: Mon, 7 Oct 2002 20:54:12 +0300 Subject: WWW - Emma is not able to create SOME temporary files Message-ID: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee> Hi, I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, but emma didn not work correctly. It gives an error: Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file It seems that by some reas?n it can not create a file under runs/temp directory. Why not - is for me unclea. All other files are there. Files under catalog runs/fileVxWbES$/) root at kobra:fileVxWbES$ ls -l total 16 -rw-r--r-- 1 www java 915 Oct 7 20:51 8825A -rw-r--r-- 1 www java 0 Oct 7 20:51 dendoutfile -rw-r--r-- 1 www java 384 Oct 7 20:51 error -rw-r--r-- 1 www java 2145 Oct 7 20:51 index.html drwxr-xr-x 2 www java 4096 Oct 7 20:51 input -rw-r--r-- 1 www java 0 Oct 7 20:51 outseq Command line clustalw works ok Is there a solution for this problem? T?nu Margus -------------- next part -------------- An HTML attachment was scrubbed... URL: From starksb at ebi.ac.uk Mon Oct 7 18:58:15 2002 From: starksb at ebi.ac.uk (David Starks-Browning) Date: Mon, 7 Oct 2002 19:58:15 +0100 Subject: WWW - Emma is not able to create SOME temporary files In-Reply-To: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee> References: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee> Message-ID: <5473-Mon07Oct2002195815+0100-starksb@ebi.ac.uk> On Monday 7 Oct 02, T?nu Margus writes: > Hi, > > I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, > but emma didn not work correctly. > > It gives an error: > > Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file > > It seems that by some reas?n it can not create a file under runs/temp directory. > Why not - is for me unclea. All other files are there. > > Files under catalog runs/fileVxWbES$/) > > root at kobra:fileVxWbES$ ls -l > total 16 > -rw-r--r-- 1 www java 915 Oct 7 20:51 8825A > -rw-r--r-- 1 www java 0 Oct 7 20:51 dendoutfile > -rw-r--r-- 1 www java 384 Oct 7 20:51 error > -rw-r--r-- 1 www java 2145 Oct 7 20:51 index.html > drwxr-xr-x 2 www java 4096 Oct 7 20:51 input > -rw-r--r-- 1 www java 0 Oct 7 20:51 outseq > > Command line clustalw works ok You don't show the permissions of the directory itself (use 'ls -la'). It's the directory permissions that determine whether files can be created. However, this may not be the problem. We have seen problems with emma on Linux, because the underlying application, clustalw, cannot deal with filenames that are 5 characters long on Linux. String buffer management bugs in emma cause it to emit garbage characters after the filename to the open() system call. With emma, you will see this when emma's PID is 4 digits long. (You won't see the garbage characters in error messages. You only see them under strace.) Clustalw should be fixed. If that won't happen, emma.c could be modified to pad the temporary file name with enough extra characters so that, regardless of Linux PID, emma will use temp filenames longer than 5 characters. I don't have a patch for the latest version of emma, because I applied the workaround to an old (1.9.1) version of EMBOSS. Emma.c has changed a bit since then, although the change is still straightforward to apply. If you think this is your problem, I can provide details on how to modify emma.c. Hope this helps. Kind regards, David ------------------------------------------------------------------- David Starks-Browning | starksb at ebi.ac.uk EMBL Outstation -- | The European Bioinformatics Institute | Wellcome Trust Genome Campus | tel: +44 (1223) 494 616 Hinxton, Cambridge, CB10 1SD, UK | fax: +44 (1223) 494 468 ------------------------------------------------------------------- From tcarver at hgmp.mrc.ac.uk Tue Oct 8 08:44:34 2002 From: tcarver at hgmp.mrc.ac.uk (Tim Carver) Date: Tue, 08 Oct 2002 09:44:34 +0100 Subject: Jemboss Server Feedback Message-ID: <3DA29AF2.BE48E5F5@hgmp.mrc.ac.uk> It would be immensely useful if those who have setup a Jemboss server could provide some feedback to us. This is useful in providing some ideas for the future direction of its development and to give our funding body some idea of its usage at other sites. In particular the following information would be of use: 1. Nationality 2. Funding body and/or Organisation 3. Server Platform O/S (linux, solaris, MacOSX, AIX, HP-UX....) 4. Type of installation - e.g. with unix authorisation 5. Number of users at your site using Jemboss 6. Comments - what where you using before & why you changed - likes, dislikes & suggestions for Jemboss development (server & client) Many thanks in advance, Tim Carver HGMP-RC From mq1 at sanger.ac.uk Tue Oct 8 13:00:06 2002 From: mq1 at sanger.ac.uk (Mike Quail) Date: Tue, 8 Oct 2002 14:00:06 +0100 Subject: restriction mapping Message-ID: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk> Hi I am currently looking to isolate restriction fragments that cover gaps that are left in several genomes. To do this I need to cut the sequence we have of a genome with all known database enzymes and then select those that just cut a few times and in the right place so as to excise the region of the genome I require. GCG programs map and mapplot were excellent for doing this. Map in particular is good as it gives a graphical plot for each enzyme (one enzyme per line) plotting all the enzymes on a page or two so you can rapidly see which is appropriate. I have tried the EMBOSS programs and basically they are no use. REMAP does what I want but in too great detail (the output would stretch round the globe) and RESTRICT is too unordered in its output. I have got a program called oligo on my PC that will do this, BUT it has problems with big sequences. Recently I tried analysing a 1.5Mb chromosome and it would only work if I limited the number of enzymes to 6 or less. So I could transfer the data over to my PC and try with that but as this organism is 5Mb it will be very slow going. Have you any ideas of how this could be done in EMBOSS. M.Quail Project Leader Wellcome Trust Sanger Institute -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.rice at uk.lionbioscience.com Tue Oct 8 13:30:31 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 14:30:31 +0100 Subject: restriction mapping References: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk> Message-ID: <3DA2DDF7.70604@uk.lionbioscience.com> Mike Quail wrote: > I am currently looking to isolate restriction fragments that cover gaps > that are left in several genomes. To do this I need to cut the sequence > we have of a genome with all known database enzymes and then select > those that just cut a few times and in the right place so as to excise > the region of the genome I require. > > Have you any ideas of how this could be done in EMBOSS. You just need to know the enzymes that only cut twice, for example? % restrict -min 2 -max 2 -plasmid (the -plasmid may look odd, but it means "circular DNA" and says nothing about the size :-) You can also check each enzyme one at a time afterwards: % restrict -plasmid -fragment -enzyme BssHI ... the -fragment option includes the fragment sizes at the end of the report. You will need the positions and the fragment sizes to choose an enzyme. You can select other report formats (-rformat), but the default is probably the most useful for your case (-rformat EMBL or GFF, for example, will miss the -fragment output) Meanwhile, a graphical view could be nice so you can look for restriction sites on screen. We can look into that. Hope this helps, Peter Rice -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From jonas.andersson at rocketmail.com Tue Oct 8 14:35:50 2002 From: jonas.andersson at rocketmail.com (Jonas Andersson) Date: Tue, 8 Oct 2002 07:35:50 -0700 (PDT) Subject: Not compiling? Message-ID: <20021008143550.40367.qmail@web40110.mail.yahoo.com> When I try to compile the latest EMBOSS this is what I get. What do I do wrong, given that I do as is suggested on the EMBOSS pages? -MT ajreport.lo -MD -MP -MF .deps/ajreport.TPlo -o ajreport.o >/dev/null 2>&1 make[1]: *** [ajreport.lo] Error 1 make[1]: Leaving directory `/home/henrik/temp/emboss/EMBOSS-2.5.1/ajax' make: *** [all-recursive] Error 1 / Jonas __________________________________________________ Do you Yahoo!? Faith Hill - Exclusive Performances, Videos & More http://faith.yahoo.com From avc at sanger.ac.uk Tue Oct 8 15:10:22 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Tue, 8 Oct 2002 16:10:22 +0100 (BST) Subject: fasta splitter In-Reply-To: Message-ID: On Tue, 8 Oct 2002, January Weiner 3 wrote: Thanks to all that responded. I did, in the end write a 12 line bioperl script to split my fasta file. My request seems, however, to highlight a small blind spot on the EMBOSS radar. It appears that there are a number of implementations out there - perhaps one of them can be donated to the emboss project as the basis of a new software tool? Tony +>Hi, +> +>> This is apparently something that is frequently asked by biologists. +>> If you call it fastasplitter, I have a Web interface ready for it: +>> http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html +>> If you think it's interesting, I install it, and in such case, I will +>> put your name (J. Weiner ?) on the Web interface. +> +>No problem, do it, it's freeware (not even GPL :-). However, if you think +>that such a tool is useful, then I'll rewrite it in C -- to make it faster. +>If I may suggest -- it'd be nice if you could download or get the produced +>files as a tgz or zip archive. +> +>j. +> +>----)-\//-///-----------------------------------January-Weiner-3------- +>Wysz?a Ho?? i Czyst?, wr?ci?a Wsp?ln? i Nieca?? [ (C) by moja babcia ] +> +> ****************************************************** Tony Cox Email:avc at sanger.ac.uk Sanger Institute WWW:www.sanger.ac.uk Wellcome Trust Genome Campus Webmaster Hinxton Tel: +44 1223 834244 Cambs. CB10 1SA Fax: +44 1223 494919 ****************************************************** From Joerg.Schaber at uv.es Tue Oct 8 15:58:11 2002 From: Joerg.Schaber at uv.es (Joerg Schaber) Date: Tue, 08 Oct 2002 17:58:11 +0200 Subject: loading DDBJ data into EMBOSS Message-ID: <3DA30093.6080404@uv.es> Hi, i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using 'dbiflat -idformat gb'. I get a warning for all entries in the flatfile 'Warning: Duplicate ID skipped: '' All hits will point to first ID found? and I can not retrieve any sequence. I think dbiflat only recognizes the first entry. When I download the corresponding fasta flatfile I have no problems creating an EMBOSS database using 'dbifasta'. However, I would like to use the original DDBJ flatfile because it includes more information. Any idea what's the problem? greetings, joerg From peter.rice at uk.lionbioscience.com Tue Oct 8 16:08:47 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 17:08:47 +0100 Subject: loading DDBJ data into EMBOSS References: <3DA30093.6080404@uv.es> Message-ID: <3DA3030F.2030808@uk.lionbioscience.com> Joerg Schaber wrote: > Hi, > > i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. > ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using > 'dbiflat -idformat gb'. I get a warning for all entries in the flatfile > 'Warning: Duplicate ID skipped: '' All hits will point to first ID > found? and I can not retrieve any sequence. I think dbiflat only > recognizes the first entry. > When I download the corresponding fasta flatfile I have no problems > creating an EMBOSS database using 'dbifasta'. However, I would like to > use the original DDBJ flatfile because it includes more information. > Any idea what's the problem? Yes ... that file is not in Genbank or DDBJ format!!!! It looks more like a CODATA format, but only the ENTRY is recognized. If you can find a name for it, we could probably implements a new input/output sequence format ... but it has some horrible features that will not be general. Example entry: ENTRY BU002 CDS Buchnera NAME atpB DEFINITION ATP synthase A chain [EC:3.6.3.14] [SP:ATP6_BUCAI] CLASS Metabolism; Energy Metabolism; Oxidative phosphorylation [PATH:buc00190] Metabolism; Energy Metabolism; ATP synthesis [PATH:buc00193] Metabolism; Energy Metabolism; Photosynthesis [PATH:buc00195] POSITION 2278..3102 DBLINKS RIKEN: BU002 NCBI: 10038695 CODON_USAGE T C A G T 27 2 22 7 11 0 7 1 7 1 1 0 1 0 0 5 C 4 0 3 2 6 1 4 2 5 1 8 2 1 0 2 0 A 28 0 5 12 5 0 3 0 7 3 13 1 4 1 0 0 G 4 1 12 3 5 1 5 0 8 0 7 1 7 2 4 0 AASEQ 274 MILEKISDPQKYISHHLSHLQIDLRSFKIIQPGALSSDYWTVNVDSMFFSLVLGSFFLSI FYMVGKKITQGIPGKLQTAIELIFEFVNLNVKSMYQGKNALIAPLSLTVFIWVFLMNLMD LVPIDFFPFISEKVFELPAMRIVPSADINITLSMSLGVFFLILFYTVKIKGYVGFLKELI LQPFNHPVFSIFNFILEFVSLVSKPISLGLRLFGNMYAGEMIFILIAGLLPWWTQCFLNV PWAIFHILIISLQAFIFMVLTIVYLSMASQSHKD NTSEQ 825 atgattttagaaaagatatctgatcctcaaaaatatattagtcatcatttaagtcacttg cagatagatttgcgttcttttaaaattattcaaccaggtgcattgtcttctgattattgg actgtaaatgttgattcaatgtttttttctcttgtactgggtagtttttttttaagtatt ttttatatggtaggaaaaaaaattactcaaggtataccaggtaaattacaaactgcaatt gagttaatttttgaatttgtaaatttaaatgtaaaaagcatgtatcaaggtaaaaatgct cttattgcacctttatcattaacagtatttatttgggtttttttaatgaatctaatggat ttagttccgattgatttctttccatttatttctgaaaaagtgtttgaattacctgctatg cgaattgtaccttctgctgatattaatattacactatcaatgtcacttggcgtgtttttt ttaattttattttatactgttaaaattaaaggatatgtaggctttttaaaagaacttatt ttacaacctttcaaccatcctgtattttctatttttaattttatattagaatttgtgtca ttggtctcgaaacccatttctttgggattgcgattatttggaaacatgtacgcaggtgaa atgatttttattttaattgcaggtttgctgccatggtggacacaatgttttttaaacgta ccgtgggctatttttcatattttaataatttcactacaggcttttatttttatggtatta actattgtatatttatcaatggcctctcaatctcataaagattaa /// -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From peter.rice at uk.lionbioscience.com Tue Oct 8 16:37:36 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 17:37:36 +0100 Subject: fasta splitter References: Message-ID: <3DA309D0.7@uk.lionbioscience.com> Tony Cox wrote: > On Tue, 8 Oct 2002, January Weiner 3 wrote: > > Thanks to all that responded. I did, in the end write a 12 line bioperl script > to split my fasta file. My request seems, however, to highlight a small blind > spot on the EMBOSS radar. It appears that there are a number of implementations > out there - perhaps one of them can be donated to the emboss project as the > basis of a new software tool? Nobody suggested hacking "seqret" to do what you want... One problem doing this in EMBOSS is the need to generate filenames for your split files - but maybe a base filename would be enough to generate names. Then all you need to do is count sequences in a modified seqret.c and change the output file. You can add a command line option for the number of sequences in an output file. Cleaning up output files for a rerun is an exercise for the user (unless you want to invent a new ACD type that does it :-) Needs a modified version of the seqFileReopen function to handle the file naming, but nothing complicated is involved. regards. Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From avc at sanger.ac.uk Tue Oct 8 16:39:31 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Tue, 8 Oct 2002 17:39:31 +0100 (BST) Subject: fasta splitter In-Reply-To: <3DA309D0.7@uk.lionbioscience.com> Message-ID: On Tue, 8 Oct 2002, Peter Rice wrote: that sounds excellent - does this mean it really will make it in to the EMBOSS release? (any idea when? ;) Tony +>Tony Cox wrote: +>> On Tue, 8 Oct 2002, January Weiner 3 wrote: +>> +>> Thanks to all that responded. I did, in the end write a 12 line bioperl script +>> to split my fasta file. My request seems, however, to highlight a small blind +>> spot on the EMBOSS radar. It appears that there are a number of implementations +>> out there - perhaps one of them can be donated to the emboss project as the +>> basis of a new software tool? +> +>Nobody suggested hacking "seqret" to do what you want... +> +>One problem doing this in EMBOSS is the need to generate filenames for your +> split files - but maybe a base filename would be enough to generate +>names. Then all you need to do is count sequences in a modified seqret.c +>and change the output file. You can add a command line option for the +>number of sequences in an output file. Cleaning up output files for a rerun +>is an exercise for the user (unless you want to invent a new ACD type that +>does it :-) +> +>Needs a modified version of the seqFileReopen function to handle the file +>naming, but nothing complicated is involved. +> +>regards. +> +>Peter +> +>-- +>------------------------------------------------ +>Peter Rice, LION Bioscience Ltd, Cambridge, UK +>peter.rice at uk.lionbioscience.com +44 1223 224723 +> ****************************************************** Tony Cox Email:avc at sanger.ac.uk Sanger Institute WWW:www.sanger.ac.uk Wellcome Trust Genome Campus Webmaster Hinxton Tel: +44 1223 834244 Cambs. CB10 1SA Fax: +44 1223 494919 ****************************************************** From peter.rice at uk.lionbioscience.com Tue Oct 8 16:42:39 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 08 Oct 2002 17:42:39 +0100 Subject: fasta splitter References: Message-ID: <3DA30AFF.3010100@uk.lionbioscience.com> Hi Tony > that sounds excellent - does this mean it really will make it in to the EMBOSS > release? (any idea when? ;) I already have the first part of the code ... a modified "seqret" to split into 10 sequences per file. Working copy is called "tenco" :-) What did you have in mind as a naming convention for the output files? The existing code names each file after the first sequence, I guess you want "outfile.1" "outfile.2" and so on, possibly with leading zeroes "outfile.,001" etc. Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From letondal at pasteur.fr Tue Oct 8 18:11:11 2002 From: letondal at pasteur.fr (Catherine Letondal) Date: Tue, 08 Oct 2002 20:11:11 +0200 Subject: fasta splitter In-Reply-To: Your message of "Mon, 07 Oct 2002 13:47:33 +0200." Message-ID: <200210081811.g98IBBuY253618@electre.pasteur.fr> January Weiner 3 wrote: > > Hello, > > > almost, but not quite. This converts one file to many files containg one > > sequence. I need something like a conversion of one file containing 1000 > > seqs to 10 files each containing 100 seqs > > I wrote you a simple perl script which should do the job. Save it to a > file and make it executable (I think you are using a Unix-based system, > aren't you?) with chmod a+x split.pl. To be on the safe side, put it in a > new directory, and copy your sequence file to the same directory. Now run > > ./split.pl > > ...where filename is the name of the file containing your 1000+ sequences, > and is the number of sequences you wish to have in > each produced file. The produced file will have the same name as the > original file with the appendix .1, .2, .3 etc. > > I tried the script and it seems to work fine. Meet the power of Perl :-) For information, I have installed this script and the program is available at: http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html -- Catherine Letondal -- Pasteur Institute Computing Center From avc at sanger.ac.uk Tue Oct 8 18:38:50 2002 From: avc at sanger.ac.uk (Tony Cox) Date: Tue, 8 Oct 2002 19:38:50 +0100 Subject: fasta splitter References: <3DA30AFF.3010100@uk.lionbioscience.com> Message-ID: <000d01c26ef9$f206f710$0a00a8c0@zeus> ----- Original Message ----- From: "Peter Rice" To: "Tony Cox" Cc: "January Weiner 3" ; ; Sent: Tuesday, October 08, 2002 5:42 PM Subject: Re: fasta splitter > Hi Tony > > > that sounds excellent - does this mean it really will make it in to the EMBOSS > > release? (any idea when? ;) > > I already have the first part of the code ... a modified "seqret" to split > into 10 sequences per file. > > Working copy is called "tenco" :-) > > What did you have in mind as a naming convention for the output files? The > existing code names each file after the first sequence, I guess you want > "outfile.1" "outfile.2" and so on, possibly with leading zeroes > "outfile.,001" etc. Hi Peter, This sounds great to me. Personally, I'd prefer not to have the leading zeros - just an incrementing ".[integer]" appended to the filename supplied. Makes shell manipulation easier. I guess the ideal would able to supply either a number of chunks to split the file in to or else specify a maximum size (either in bytes or fasta entries) for each chunk. cheers Tony From mathog at mendel.bio.caltech.edu Tue Oct 8 19:00:12 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Tue, 08 Oct 2002 12:00:12 -0700 Subject: fasta splitter Message-ID: There's more than one way to split a fasta file... 1. Split M entries into N files, file 1 receives 1->M/N, file 2 receives M/N+1->2M/N, etc. Advantages - only one file needs to be open at a time, simple. Disadvantage - the resulting split is typically uneven. Do this with the NCBI databases and you'll find that they are heavily weighted for smaller sequences at the beginning and longer ones at the end. If the point of the split is to load balance (this is what I use it for, with parallel BLAST) some nodes will finish much earlier than others. Implementation: (deleted, I found this method not to be generally useful) 1b. head/tail/segment entries out of a fasta file. While (1) caused a lot of problems I've often needed to chop out a specific part of a fasta file. Why? Because some piece of software was blowing up on the 351,234 entry, but only if preceded by several thousand other entries. Finding the smallest piece that will trigger the bug can save hours of run time debugging these sorts of problems. Implementation: ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c 2. Split M entries into N files, cycling output to each file. That is, entry M goes to file M modulo N. Advantage - resulting files tend to be more even in size. Disadvantage - N output files must be open at once (or you have to cycle through N times, once per phase); if M is small and the size of each entry large the resulting files will not generally be balanced. Example, splitting the yeast genome, heaven help us when full length human chromosomes start showing up as single FASTA file entries. Implementation: ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c 3. Split P bases in M entries into N files "evenly", fragmenting sequences if they are too large. Advantage: fixes the genome data problem from (2). Disadvantages: even more complex than (2) and "entries" in resulting files do not correspond one to one with the original. Even with clever naming conventions (yeastII_100001_200000) end users will be confused. Clever names will be truncated by most software at the worst possible place resulting in a "hit" on "yeastII_" :-(. Implemenation: (well, partially, this one translates in all 6 frames, but it has some of the naming/fragmenting features): ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c 4. Split by content. Ie, strip all the human sequences out of nr. I don't beleive there is a general solution because there is no univerasally agreed upon FASTA header line format. Implementation: SRS or something similar. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From squiresb at macrogenics.com Tue Oct 8 19:20:35 2002 From: squiresb at macrogenics.com (Burke Squires) Date: Tue, 08 Oct 2002 14:20:35 -0500 Subject: eprimer3...broken pipe? In-Reply-To: Message-ID: I have tried to install various version of emboss and when I try and run eprimer3 I get the following message: [loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3 Picks PCR primers and hybridization oligos Input sequence(s): /bioinfo/fragments.fa Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out EMBOSS An error in eprimer3.c at line 317: The program 'primer3_core' must be on the path. It is part of the 'primer3' package, available from the Whitehead Institute. See: http://www-genome.wi.mit.edu/ Broken pipe Does anybody know how to fix this? Thanks! Burke Squires -- Burke Squires Bioinformatics MacroGenics, Inc. 2600 Stemmons Freeway, Suite 210 Dallas, TX 75235 USA Work: 214-634-3000 X224 Squiresb @ macrogenics.com (Please remove spaces to use) www.macrogenics.com ---------------------------------------------------------------------------- This e-mail and any attachments may be confidential or legally privileged. If you received this message in error or are not the intended recipient, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained herein. Please inform us of the erroneous delivery by return e-mail. Thank you for your cooperation. From tchiang at bioinfo.sickkids.on.ca Tue Oct 8 19:31:18 2002 From: tchiang at bioinfo.sickkids.on.ca (Ted Chiang) Date: Tue, 8 Oct 2002 15:31:18 -0400 (EDT) Subject: cusp Message-ID: Hi, I have question about the Emboss program cusp. The program creates a codon usage table based on the "coding" sequence of the input file. My question is how does it determine where the 'coding' (or ORF) sequence given any DNA sequence when one executes the program without specifying the -sbeg and -send flags. ie. $cusp dna_seq How does cusp determine where the coding sequence begins? As opposed to $cusp dna_seq -sbegin 135 -send 192 where the coding sequence is specified. In the latter case, how does if the specified region is not divisible by 3, does cusp ignore the latter few nucleotides? Thanks. -Ted ===================================== Ted Chiang, Analyst Centre for Computational Biology Hospital for Sick Children, Toronto 416.813.7028 tchiang at bioinfo.sickkids.on.ca ===================================== From sebastian.bassi at ar.advantaseeds.com Tue Oct 8 20:05:43 2002 From: sebastian.bassi at ar.advantaseeds.com (Sebastian Bassi) Date: Tue, 8 Oct 2002 22:05:43 +0200 Subject: fasta splitter Message-ID: > What did you have in mind as a naming convention for the > output files? The > existing code names each file after the first sequence, I > guess you want > "outfile.1" "outfile.2" and so on, possibly with leading zeroes > "outfile.,001" etc. My $.02: I think that outfile.[number_here] is not a good convention, since the extension (whatever you put after the dot) means the file type, and here the file type is always the same (ASCII text). I think it should be something like: outfile_[number].txt It should look like this: outfile_1.txt outfile_2.txt outfile_3.txt Anyway, IANAP (I am not a programmer) I'm just an end user and I'm stating this from a user consistency view point. If I have two mp3 files (xsongpart1 and xsongpart2) I would name them as part1.mp3 and part2.mp3 and NOT xsong.1 and xsong.2 From mathog at mendel.bio.caltech.edu Tue Oct 8 20:33:56 2002 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Tue, 08 Oct 2002 13:33:56 -0700 Subject: fasta splitter Message-ID: > > What did you have in mind as a naming convention for the > > output files? The > > existing code names each file after the first sequence, I > > guess you want > > "outfile.1" "outfile.2" and so on, possibly with leading zeroes > > "outfile.,001" etc. > > My $.02: I think that outfile.[number_here] is not a good convention, since the extension (whatever you put after the dot) means the file type, and here the file type is always the same (ASCII text). I think it should be something like: > outfile_[number].txt > It should look like this: > outfile_1.txt I agree. Also the numeric range should be displayed in a fixed column width. Ideally something like: % esplit \ -sequence=ncbi_nr.nfa \ -fmask='nr_frag_####.nfa' \ -spitn=20 \ -splitmode=cycle \ -numberfrom=0 would produce nr_frag_0000.nfa ... nr_frag_0019.nfa Keeping the names fixed width prevents all sorts of text alignment problems which can show up otherwise. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From fernan at iib.unsam.edu.ar Tue Oct 8 22:51:16 2002 From: fernan at iib.unsam.edu.ar (Fernan Aguero) Date: Tue, 8 Oct 2002 19:51:16 -0300 Subject: fasta splitter In-Reply-To: <3DA309D0.7@uk.lionbioscience.com> References: <3DA309D0.7@uk.lionbioscience.com> Message-ID: <20021008225116.GA273@iib.unsam.edu.ar> +----[ Asi hablaba Peter Rice (peter.rice at uk.lionbioscience.com): | [ snipped ] | | One problem doing this in EMBOSS is the need to generate filenames for your | split files - but maybe a base filename would be enough to generate | names. Now let me get myself into the discussion. The splitter I use is called 'shatter' and is part of the SEALS package, which I guess is unmaintained (and perhaps obsolete?) and is basically perl. ftp://ftp.ncbi.nih.gov/pub/walker/seals/software The following discussion works for splitting into individual sequences, but not into groups of sequences. In this case a different naming scheme should be used, (though perhaps the same argument specifier '-word' could be used?). The approach of shatter (both for splitting FASTA files, but also for splitting concatenated BLAST reports, which are splitted by 'shatterblast') is to let you choose the 'word' which will be used as a basename. Both shatters know about the NCBI FASTA standard and thus, given a FASTA header like the following: >gi|123456|gb|AA123456|AA123456.1 Homo sapiens protein X etc will take the gi as word 2 (123456), the accession number (AA123456) as word 4, the accession.version (AA123456.1) as word 5 and so on. In the command-line you just say 'shatter -word 1 fastafile' if you want the first word after the '>' to be the basename. This produces files with that basename and terminated in .fa The program will consider whitespace and the character '|' as word delimiters. In my own experience this is a good thing. I've used shatter with many different FASTA flavours and adjusting the word to be used as basename is plain easy. BLAST reports are also trivial since query sequences, are also usually in FASTA format, and you get basically the same header, though after the 'Query=' magic word. In this case you get files with the same basename, but ending in .br Just my 2 cents. Hope this makes it into EMBOSS. Fernan | and change the output file. You can add a command line option for the | number of sequences in an output file. Cleaning up output files for a rerun | is an exercise for the user (unless you want to invent a new ACD type that | does it :-) | | Needs a modified version of the seqFileReopen function to handle the file | naming, but nothing complicated is involved. | | regards. | | Peter | | -- | ------------------------------------------------ | Peter Rice, LION Bioscience Ltd, Cambridge, UK | peter.rice at uk.lionbioscience.com +44 1223 224723 | | | +----] -- F e r n a n A g u e r o http://genoma.unsam.edu.ar/~fernan From gwilliam at hgmp.mrc.ac.uk Wed Oct 9 08:28:08 2002 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Wed, 09 Oct 2002 09:28:08 +0100 Subject: eprimer3...broken pipe? References: Message-ID: <3DA3E898.C9318CEE@hgmp.mrc.ac.uk> >From the eprimer3 documentation: The Whitehead program must be set up and on the path in order for eprimer3 to find and run it. The Whitehead Institute program that is run by this program is available from: http://www-genome.wi.mit.edu/genome_software/other/primer3.html (Then see the link 'Get release 0.9') The version that is run by this program is 3.0.9 currently available from: http://www-genome.wi.mit.edu/ftp/distribution/software/primer3_0_9_test.tar.gz Gary Burke Squires wrote: > > I have tried to install various version of emboss and when I try and run > eprimer3 I get the following message: > > [loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3 > Picks PCR primers and hybridization oligos > Input sequence(s): /bioinfo/fragments.fa > Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out > > EMBOSS An error in eprimer3.c at line 317: > The program 'primer3_core' must be on the path. > It is part of the 'primer3' package, > available from the Whitehead Institute. > See: http://www-genome.wi.mit.edu/ > Broken pipe > > Does anybody know how to fix this? > > Thanks! > > Burke Squires > > -- > Burke Squires > Bioinformatics > MacroGenics, Inc. > 2600 Stemmons Freeway, Suite 210 > Dallas, TX 75235 USA > Work: 214-634-3000 X224 > Squiresb @ macrogenics.com (Please remove spaces to use) > www.macrogenics.com > ---------------------------------------------------------------------------- > This e-mail and any attachments may be confidential or legally privileged. > If you received this message in error or are not the intended recipient, you > should destroy the e-mail message and any attachments or copies, and you are > prohibited from retaining, distributing, disclosing or using any information > contained herein. Please inform us of the erroneous delivery by return > e-mail. > > Thank you for your cooperation. -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From Joerg.Schaber at uv.es Wed Oct 9 17:02:51 2002 From: Joerg.Schaber at uv.es (Joerg Schaber) Date: Wed, 09 Oct 2002 19:02:51 +0200 Subject: swissprot Message-ID: <3DA4613B.3010901@uv.es> Hi, can't load the SWISSPROT- bacteria database (ftp://ftp.ebi.ac.uk/pub/databases/swissprot/special_selections/bacteria.seq) into EMBOSS. I think EMBOSS is running well because I have no problem accessing the test-databases (see showdb below). However, I think somehow seqret is using the wrong division file but the PATH-setting seem to be correct. greetings, joerg > dbiflat Index a flat file database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ REFSEQ : Refseq Entry format [SWISS]: SWISS Database directory [.]: Wildcard database filename [*.dat]: *.seq Database name: swissbac Release number [0.0]: 1.0 Index date [00/00/00]: 09/10/02 > ll insgesamt 132100 950883 drwxrwxr-x 2 root users 4096 Okt 9 18:50 . 623533 drwxrwxr-x 5 jos jos 4096 Okt 9 17:50 .. 950889 -rw-r--r-- 1 jos jos 189028 Okt 9 18:50 acnum.hit 950888 -rw-r--r-- 1 jos jos 660456 Okt 9 18:50 acnum.trg 623548 -rw-r--r-- 1 jos jos 133412511 Okt 9 18:25 bacteria.seq 950886 -rw-r--r-- 1 jos jos 322 Okt 9 18:50 division.lkp 950887 -rw-r--r-- 1 jos jos 836840 Okt 9 18:50 entrynam.idx > showdb Displays information on the currently available databases # Name Type ID Qry All Comment # ==== ==== == === === ======= swissbac P OK OK OK SWISSPROT sequences of procaryotes 9/10/02 tpir P OK OK OK PIR using NBRF access for 4 files tsw P OK OK OK Swissprot native format with EMBL CD-ROM index tswnew P OK OK OK Swissnew as 3 files in native format with EMBL CD-ROM index twp P OK OK OK EMBL new in native format with EMBL CD-ROM index buch N OK OK OK Buchnera database in DDBJ Format fbuch N OK OK OK Buchnera database in FASTA Format tembl N OK OK OK EMBL in native format with EMBL CD-ROM index tgb N OK - - Genbank IDs tgenbank N OK OK OK GenBank in native format with EMBL CD-ROM index > head bacteria.seq ID 120K_RICRI STANDARD; PRT; 1300 AA. AC P14914; --snipp --snipp > seqret swissbac:120K_RICRI Reads and writes (returns) sequences Warning: Cannot open division file '' for database 'swissbac' Warning: seqCdQry failed Error: Unable to read sequence 'swissbac:120K_RICRI' > -- ---------------------------------------------------------- Joerg Schaber Instituto Cavanilles de Biodiversidad y Genetica Evolutiva Universidad de Valencia Tel.: ++34 96 398 3647 A.C. 22085 Fax.: ++34 96 398 3670 46071 Valencia, Espa?a email : jos at uv.es From jweiner1 at ix.urz.uni-heidelberg.de Thu Oct 10 09:17:51 2002 From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3) Date: Thu, 10 Oct 2002 11:17:51 +0200 (METDST) Subject: fasta splitter In-Reply-To: <000d01c26ef9$f206f710$0a00a8c0@zeus> Message-ID: > This sounds great to me. Personally, I'd prefer not to have the leading > zeros - just an incrementing ".[integer]" appended to the filename supplied. > Makes shell manipulation easier. Well, I'd prefer the former -- because it makes shell manipulation easier :-) If you stay with the leading 0's, then any listing will show the files in the correct order, otherwise it will show "foo.1, foo.10, ..., foo.100,... foo.2, ..." etc. j. ----)-\//-///-----------------------------------January-Weiner-3------- "'Tis true, there's magic in the web of it." -- Shakespeare From kenneth at geisshirt.dk Mon Oct 14 11:02:15 2002 From: kenneth at geisshirt.dk (Kenneth Geisshirt) Date: Mon, 14 Oct 2002 13:02:15 +0200 (CEST) Subject: Splitting genbank Message-ID: Hi everyone I recently joined the mailing list (after a couple of weeks usage of EMBOSS) so I hope that my question isn't a FAQ. I have a local copy of genbank, and I wish to split it into four databases: one for humans, one of rats, one of mouses and one for the rest. The applications seqret and seqretsplit can help me with the first three by specifying the organism in the usa, but how do I specify "not human and not rat and not mouse"? Thanks in advance Kneth -- Kenneth Geisshirt, M.Sc., Ph.D. http://kenneth.geisshirt.dk Gr?ndals Parkvej 2A, 3. sal kenneth at geisshirt.dk DK-2720 Vanl?se +45 38 87 78 38 From peter.rice at uk.lionbioscience.com Mon Oct 14 11:27:34 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 14 Oct 2002 12:27:34 +0100 Subject: Splitting genbank References: Message-ID: <3DAAAA26.1080707@uk.lionbioscience.com> Kenneth Geisshirt wrote: > I have a local copy of genbank, and I wish to split it into four > databases: one for humans, one of rats, one of mouses and one for the > rest. The applications seqret and seqretsplit can help me with the first > three by specifying the organism in the usa, but how do I specify "not > human and not rat and not mouse"? In EMBOSS .... split the gbrod file into rat, mouse and other rodents (a simple perl script would do) index and define GenBank then define subsets using the same index files and exclude the ones you don't want using, for example: exclude: "*pri* *rat* *mus*" ... in copies of your EMBOSS database definition for genbank. EMBOSS simply checks the excluded files list when using the index files. regards, Peter Rice -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From Joerg.Schaber at uv.es Mon Oct 14 12:11:59 2002 From: Joerg.Schaber at uv.es (Joerg Schaber) Date: Mon, 14 Oct 2002 14:11:59 +0200 Subject: other indices Message-ID: <3DAAB48F.6080704@uv.es> Hi, dbiflat allows to index other fields except id and accession number like sequence version (seqv), description (des), keywords and taxon. However, in the example databases that come with EMBOSS I found only field definitions like 'fields: "sv des org key"'. So do I access the additional indices (e.g. in seqret) via 'seqret-sv:\*', 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*' did not work. Greetings, joerg From gwilliam at hgmp.mrc.ac.uk Mon Oct 14 12:18:20 2002 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Mon, 14 Oct 2002 13:18:20 +0100 Subject: other indices References: <3DAAB48F.6080704@uv.es> Message-ID: <3DAAB60C.ACF4111E@hgmp.mrc.ac.uk> See: http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/UniformSequenceAddress.html#keys You append the 'sv', 'des', 'org', 'key', etc to the database name with a '-' and to a file name with a ':', so: with a database you use a command like: seqret embl-des:fau with a file you use a command like: seqret filename:org:homo Gary Joerg Schaber wrote: > > Hi, > > dbiflat allows to index other fields except id and accession number like > sequence version (seqv), description (des), keywords and taxon. However, > in the example databases that come with EMBOSS I found only field > definitions like 'fields: "sv des org key"'. So do I access the > additional indices (e.g. in seqret) via 'seqret-sv:\*', > 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? > 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*' did not work. > > Greetings, > > joerg -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From peter.rice at uk.lionbioscience.com Mon Oct 14 12:20:40 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Mon, 14 Oct 2002 13:20:40 +0100 Subject: other indices References: <3DAAB48F.6080704@uv.es> Message-ID: <3DAAB698.1080108@uk.lionbioscience.com> Joerg Schaber wrote: > dbiflat allows to index other fields except id and accession number like > sequence version (seqv), description (des), keywords and taxon. However, > in the example databases that come with EMBOSS I found only field > definitions like 'fields: "sv des org key"'. So do I access the > additional indices (e.g. in seqret) via 'seqret-sv:\*', > 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? > 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*' did not work. For a database called schaber dbiflat -fields "acnum,seqvn,des,keyword,taxon" In the emboss.default definition: DB schaber [ type: P format: swiss method: emblcd dir: /data/schaber indexdir: /data/schaber comment: "Flatfiles database, all fields indexed" fields: "sv des org key" ] In EMBOSS programs, use the USA: 'schaber-sv:\*' 'schaber-des:\*' 'schaber-org:\*' 'schaber-key:\*' The confusion comes because the database definition (and the USA syntax) uses the field names in common use (e.g. in SRS) and dbiflat uses the EMBLCD/Staden index file names that dbiflat will be writing. regards, Peter -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From jmuehlis at uni-muenster.de Tue Oct 15 09:03:19 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 15 Oct 2002 11:03:19 +0200 Subject: this format is not readable by seqret Message-ID: <3DABD9D7.4101AA0D@uni-muenster.de> Hello there, my name is J?rg M?hlisch and I work in the Departement of pediatric hematology and oncology at the University of Munster (Germany). As a Scientist I use emboss on linux. So here is my first question: I have a sample of sequences in different formats. Before I try to index them tested them for readablility by seqret: find ./ -name "*" -exec seqret -osf fasta {} ../Sequencesothers/{} /; Some of my files are not readable and I do not know the name of their format: Contig 1 (1,506) Contig Length: 506 bases Average Length/Sequence: 458 bases Total Sequence Length: 1375 bases Top Strand: 3 sequences Bottom Strand: 0 sequences Total: 3 sequences ^^ AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC... May be there is a way to change this format in an apropriate way. Thanks J?rg M?hlisch -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch URL: From peter.rice at uk.lionbioscience.com Tue Oct 15 09:16:32 2002 From: peter.rice at uk.lionbioscience.com (Peter Rice) Date: Tue, 15 Oct 2002 10:16:32 +0100 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> Message-ID: <3DABDCF0.6090809@uk.lionbioscience.com> Joerg Muehlisch wrote: > Some of my files are not readable and I do not know the name of their > format: > > Contig 1 (1,506) > Contig Length: 506 bases > Average Length/Sequence: 458 bases > Total Sequence Length: 1375 bases > Top Strand: 3 sequences > Bottom Strand: 0 sequences > Total: 3 sequences > ^^ > AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC... > > May be there is a way to change this format in an apropriate way. Should be possible, if the format is common enough. Where does the file come from? Does this program/package have an option to save in one of the (many) 'standard' formats? regards, Peter Rice -- ------------------------------------------------ Peter Rice, LION Bioscience Ltd, Cambridge, UK peter.rice at uk.lionbioscience.com +44 1223 224723 From jmuehlis at uni-muenster.de Tue Oct 15 09:44:28 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 15 Oct 2002 11:44:28 +0200 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> Message-ID: <3DABE37C.B045FF6E@uni-muenster.de> Hi, in fact I hoped that anybody in the List would know where this format comes from. In my file sample I just found some of thes unreadable sequences. As it does not seem to be a good known format, I will try to find out where it is used. Thanks Jorg Peter Rice wrote: > Should be possible, if the format is common enough. > > Where does the file come from? Does this program/package have an option to > save in one of the (many) 'standard' formats? > > regards, > > Peter Rice > > -- > ------------------------------------------------ > Peter Rice, LION Bioscience Ltd, Cambridge, UK > peter.rice at uk.lionbioscience.com +44 1223 224723 -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch URL: From kdj at sanger.ac.uk Tue Oct 15 10:38:46 2002 From: kdj at sanger.ac.uk (Keith James) Date: 15 Oct 2002 11:38:46 +0100 Subject: this format is not readable by seqret In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de> References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> Message-ID: >>>>> "Joerg" == Joerg Muehlisch writes: Joerg> Hi, in fact I hoped that anybody in the List would know Joerg> where this format comes from. In my file sample I just Joerg> found some of thes unreadable sequences. As it does not Joerg> seem to be a good known format, I will try to find out Joerg> where it is used. I _think_ this may be flatfile output from DNAStar/Lasergene. It's been a while since I've seen any files like that but the ^^ delimiter reminded me of it. I don't have acces to the package to verify this. Keith -- - Keith James bioinformatics programming support - - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK - From jrvalverde at cnb.uam.es Tue Oct 15 12:09:12 2002 From: jrvalverde at cnb.uam.es (José R. Valverde) Date: Tue, 15 Oct 2002 14:09:12 +0200 Subject: this format is not readable by seqret In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de> References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> Message-ID: <20021015140912.7294cd80.jrvalverde@cnb.uam.es> On Tue, 15 Oct 2002 11:44:28 +0200 Joerg Muehlisch wrote: > Hi, > > in fact I hoped that anybody in the List would know where this format > comes from. In my file sample I just found some of thes unreadable > sequences. > As it does not seem to be a good known format, I will try to find out > where it is used. > May be it would help if you were able to post a full file sample. >From the fragments you posted it looked like a sequencing project file. It mentioned a contig size, with many gel readings of average length and the orientation coverage of gels (+/- strands). Iff the sequence contained (you only included a few bases) is just the consensus, i.e. a single sequence of length exactly equal the consensus length, then conversion should be trivial to any format. Simply do a 'tail + 8 {}' Otherwise it might contain the gel readings (and the consensus?), and then it would be a multiple sequence file, possibly with gel overlaps et al. and conversion may be a bit more difficult. It may be also that more than one contig and associated files is included in one file, making processing more difficult. Initially I would expect the second choice to be true, from the header: several short sequences making up a contig plus the consensus, in your example, the first contig would be 506 bases, composed of three gels of average length 458. Since 1375/3 = 458, I deduce that the consensus sequence is not included. Therefore you have a multiple sequence file of overlapping gel readings. You may try this: 1) find out if more than one contig is in the file 2) find out how sequences are separated 3) decide what you want to do with them, e.g. split the file at "^Contig " lines strip comment lines (^*:*$) split at sequence separators see csplit(1) for details on how to do it on a pipeline. E.g. assuming sequences are delimited by a blank line, this _might_ work: csplit file /^Contig / -f config foreach i ( contig.* ) tail +8 $i | csplit - /\ \ / -f ${i}.gel end (note that we need to scape newlines directly) and you'd get the raw sequences all right as contig.##.gel.## j From jmuehlis at uni-muenster.de Tue Oct 15 13:50:03 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 15 Oct 2002 15:50:03 +0200 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> Message-ID: <3DAC1D0B.3ACD64FD@uni-muenster.de> Keith James wrote: Yes I think that might be, I think our collaboration Group is working with DNAStar. But nevertehless there does not seem to be an emboss way to change the file format. So I will try it with Linux tools like tr. Thanks for your help. Jorg > I _think_ this may be flatfile output from DNAStar/Lasergene. It's > been a while since I've seen any files like that but the ^^ delimiter > reminded me of it. > > I don't have acces to the package to verify this. > > Keith > > -- > > - Keith James bioinformatics programming support - > - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK - -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch URL: From gbottu at ben.vub.ac.be Mon Oct 21 08:34:38 2002 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Mon, 21 Oct 2002 10:34:38 +0200 (CEST) Subject: question about fuzzpro and PROSITE Message-ID: <200210210834.KAA1459646@black.vub.ac.be> from : BEN Dear colleagues, While doing some experimenting with fuzzpro, I tried the following : ----------------- Input sequence(s): sw:pap?_carpa Search pattern: from : BEN Dear colleagues, I was looking at what the program prophecy is doing and I am puzzled. What is the difference between Gribskov and Henikoff profiles ? Both seem to have match/mismatch scores computed with the help of a scoring matrix as well as gap penalties. Furthermore, I thought that the Henikoff's made the Blocks databank using pprofiles without gaps. Can someone help me ? Guy Bottu From ableasby at hgmp.mrc.ac.uk Mon Oct 21 09:11:57 2002 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Mon, 21 Oct 2002 10:11:57 +0100 (BST) Subject: question about fuzzpro and PROSITE Message-ID: <200210210911.KAA28060@bromine.hgmp.mrc.ac.uk> Terminating full-stops are currently not part of the EMBOSS implementation of PROSITE patterns. Strictly they are, although unnecessary, part of the PROSITE syntax so we can accept them for future releases. For now if you just omit the '.' the pattern will work. Alan From gbottu at ben.vub.ac.be Mon Oct 21 09:36:05 2002 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Mon, 21 Oct 2002 11:36:05 +0200 (CEST) Subject: question about fuzzpro and PROSITE Message-ID: <200210210936.LAA1462808@black.vub.ac.be> Without the '.' it does not give an error. I get : ------------------ > fuzzpro Protein pattern search Input sequence(s): sw:pap?_carpa Search pattern: We'll look into that. Looks to be a boundary condition affecting zero length N terminal ranges. Thanks Alan From simon.andrews at bbsrc.ac.uk Mon Oct 21 14:00:35 2002 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 21 Oct 2002 15:00:35 +0100 Subject: Indexing Refseq Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28753@bi-exsrv1.iapc.bbsrc.ac.uk> I'm having all sorts of problems working with the latest release of RefSeq, due to a change in the way the files are being laid out. In older releases of RefSeq the LOCUS identifier was the same as the accession number (eg NM_0123456), but in the latest version the LOCUS identifier is the gene identifier, and these aren't unique in the database!! This means that when I run dbiflat (even using -idformat REFSEQ) I get a load of warnings about duplicate entries and when I later try to use the database I find that a load of entries are inaccessible because of this. For example accessions NM_134265,NM_134264 and NM_015626 all have the ID WSB1. How can I get dbiflat to index with the accession number as it's primary identifier so I don't lose entries when indexing them?? Thanks Simon PS This actually looks like a mistake by the RefSeq curators - I mean who thought that having a non-unique primary sequence identifier was a good idea!!! -- Simon Andrews PhD Bioinformatics Dept The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0)1223 496463 From simon.andrews at bbsrc.ac.uk Mon Oct 21 15:24:39 2002 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 21 Oct 2002 16:24:39 +0100 Subject: Indexing Refseq Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28754@bi-exsrv1.iapc.bbsrc.ac.uk> > -----Original Message----- > From: simon andrews (BI) [mailto:simon.andrews at bbsrc.ac.uk] > Subject: Indexing Refseq > > > I'm having all sorts of problems working with the latest > release of RefSeq > > This means that when I run dbiflat (even using -idformat > REFSEQ) I get a load of warnings about duplicate entries and > when I later try to use the database I find that a load of > entries are inaccessible because of this. > > For example accessions NM_134265,NM_134264 and NM_015626 all > have the ID WSB1. Just to follow up to myself - I've found a temporary work-round for this problem. The Bioperl script at the bottom of the message will pre-process the current Refseq files into a format which dbiflat can then index without errors. You will see a warning from the NC_xxxx chromosome files in Refseq, but as these are only features with no sequence I wasn't too worried about them and just skipped them. Usage of the script is "script_name [infile] > outfile". TTFN Simon. ------------------------------------------------------------- #!/usr/bin/perl -w use strict; use Bio::SeqIO; # This script is a filter through which we can # pass the whole of refseq. Newer versions of # refseq replaced their locus ID with a string # which wasn't the accession number. This # just changes them back. my ($filename) = @ARGV; die "No filename given" unless ($filename); my $in = Bio::SeqIO -> new(-file => $filename, -format => 'genbank'); die "Couldn't read $filename" unless ($in); my $out = Bio::SeqIO -> new(-fh => \*STDOUT, -format => 'genbank'); die "Couldn't make output pipe" unless ($out); while (my $seq = $in -> next_seq()){ # Some NC_xxx seqs are in the Refseq file # but don't have any sequence attached. We'll # skip those files... next if ($seq -> accession =~ /^NC/); $seq -> display_id($seq-> accession()); $out -> write_seq($seq); } #------------------------------------------------------- From jmuehlis at uni-muenster.de Tue Oct 22 08:06:55 2002 From: jmuehlis at uni-muenster.de (Joerg Muehlisch) Date: Tue, 22 Oct 2002 10:06:55 +0200 Subject: this format is not readable by seqret References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com> <3DABE37C.B045FF6E@uni-muenster.de> <3DAC1D0B.3ACD64FD@uni-muenster.de> Message-ID: <3DB5071F.EEB21A1@uni-muenster.de> Hi, Just for your information. This is the answer from my collaborators: The sequence is a DNAStar EditSeq file. The notation indicates that this sequence is consensus sequence from multiple reads put into a contig. If you do not have DNAStar, try to open with a wordprocessor program and cut and paste the sequence into whatever sequence editor you use. The sequence uses standard nomenclature (ie. W = A or T; M = A or C; etc.....) Thanks for your help. As this format is not readable I will now just change the format by other means. Jorg -------------- next part -------------- A non-text attachment was scrubbed... Name: jmuehlis.vcf Type: text/x-vcard Size: 339 bytes Desc: Karte f?r Joerg Muehlisch URL: From Andres.Aeschlimann at id.unibe.ch Tue Oct 22 15:23:31 2002 From: Andres.Aeschlimann at id.unibe.ch (Andres Aeschlimann) Date: Tue, 22 Oct 2002 17:23:31 +0200 (MET DST) Subject: Cannot connect! Message-ID: Hi all Having installed jemboss for the first time. There's still a problem left: After launching emboss from http://ubecx04.unibe.ch:8080/jemboss/Jemboss.jnlp ( a trial campus emboss server ) the webstart window appears as it should, and the login window as well, where username and password can be entered. Later on the window says Cannot connect! and a window "Check Public Server Settings" with the contents of the jemboss.properties file: user.auth=true jemboss.server=true server.public=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter server.private=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter service.public=JembossAuthServer service.private=JembossAuthServer plplot=/products/emboss/emboss/share/EMBOSS/ embossData=/products/emboss/emboss/share/EMBOSS/data/ embossBin=/products/emboss/emboss/bin/ embossPath=/usr/bin/:/bin:/packages/clustal/:/packages/primer3/bin: acdDirToParse=/products/emboss/emboss/share/EMBOSS/acd/ embossURL=http://www.uk.embnet.org/Software/EMBOSS/Apps/ appears. soap-2_3_1 and jakarta-tomcat-4.1.12 are installed as described in order to use with ftp://ftp.hgmp.mrc.ac.uk/pub/EMBOSS/patchfiles/install-jemboss-server.sh rpcrouter listens on https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter : SOAP RPC Router Sorry, I don't speak via HTTP GET- you have to use HTTP POST to talk to me. ubecx04:/products/emboss.222 % java -version java version "1.4.0_00" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_00-b05) Java HotSpot(TM) Client VM (build 1.4.0_00-b05, mixed mode) on Solaris 9. Is there any log file where the cause would be explained? Thanks in advance for any hint. Res ========================================================= Dr. Andres Aeschlimann Andres.Aeschlimann at id.unibe.ch University of Berne Gesellschaftsstrasse 6 CH-3012 BERNE tel: +41 31 631 3845 Switzerland fax: +41 31 631 3865 From gbottu at ben.vub.ac.be Thu Oct 24 14:20:43 2002 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Thu, 24 Oct 2002 16:20:43 +0200 (CEST) Subject: questions about codon usage tables Message-ID: <200210241420.QAA1196695@black.vub.ac.be> from : BEN Dear colleagues, I just took a look at codon usage tables under EMBOSS. - there is a list of tables in .../share/EMBOSS/data/CODONS. Unfortunately, they have rather cryptic names. Is there a way to find out for which organism they are ? And from which data source do they come ? - there is a program cutgextract. I tried it : > cutgextract Extract data from CUTG CUTG directory [.]: /db/cutg (here is the file cutg.dat) But it does ... nothing. Anyone a clue ? Sincerely, Guy Bottu From areagp61 at yahoo.it Fri Oct 25 09:03:36 2002 From: areagp61 at yahoo.it (Graziano P.) Date: Fri, 25 Oct 2002 11:03:36 +0200 Subject: -filter option for water and stretcher Message-ID: <001e01c27c05$690c7520$18105709@italy.ibm.com> Hi All, I need to introduce sequences by standard input. I have found the -filter qualifier in the -help -verbose options. For example, if I use this qualifier for "transeq" I write: transeq -filter then I have to insert my sequence (in fasta format for example) pasting or writing it. When I have finished writing or pasting the sequences, I have to press CTRL-D to terminate the standard input introduction. Finally the program return the standard output. I have tried to use the -filter qualifier with "water" and "stretcher". These two programs require two sequences in input in different files. If I write as standard input: >HTRE_ECOLI P33129 OUTER MEMBRANE USHER PROTEIN ... PGVYDVSVYVNDQPIINQSITFVAIEGKKNAQACITLKNLLQFHINSPDINNEKAVLLAR DETLGNCLNLTEIIPQASVRYDVNDQRLDIDVPQAWVMKNYQNYVDPSLWENGINAAMLS NDQRLDIDVP >YCJV_ECOLI P77481 HYPOTHETICAL ABC TRANSPORTER ... MAQLSLQHIQKIYDNQVHVVKDFNLEIADKEFIVFVGPSGCGKSTTLRMIAGLEEISGGD LLIDGKRMNDVPAKARNIAMVFQNYALYPHMTVYDNMAFGLKMQKIAKEVIDERVNWAAQ KISVAELTGAEFMLYTTVGGTS when I press CTRL-D I get the following error message: Error: Unable to read sequence '' How can I tell to standard input that what I paste or write are two different sequences? Is there any separator character that do it? Best regards Graziano ______________________________________________________________________ Scarica il nuovo Yahoo! Messenger: con webcam, nuove faccine e tante altre novit?. http://it.yahoo.com/mail_it/foot/?http://it.messenger.yahoo.com/ From aralp001 at udcf.gla.ac.uk Fri Oct 25 15:04:22 2002 From: aralp001 at udcf.gla.ac.uk (Dr Adam Ralph) Date: Fri, 25 Oct 2002 16:04:22 +0100 (BST) Subject: multi-page graphical output In-Reply-To: <3DA4613B.3010901@uv.es> Message-ID: Dear Anyone, I am trying to write a program which outputs a graph, similar to plotcon or cpgplot. It would appear that the way these programs are constructed, the graph is plotted on one page. Thus if you have a large sequence the graph looks a bit of a mess. Other types of graphical program (like prettyplot) which plot lines of text are able to alter the number of characters per line and produce multiple pages. My question is can someone show me or give me an example program which splits histogram/graph plots into multiple pages? Thus on one page you can have a graph of residues 1-1000, then graph of 1001-2000 etc. Thanks in advance Adam Dr. Adam Ralph Institute of Virology University of Glasgow Church Street Glasgow G11 5JR Phone: 0141 330 6268 Fax: 0141 337 2236 email: a.ralph at vir.gla.ac.uk From ggaz at cpqrr.fiocruz.br Wed Oct 9 21:19:56 2002 From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli) Date: Wed, 9 Oct 2002 18:19:56 -0300 Subject: jemboss Message-ID: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br> I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this. Could you help me? Thanks, Solange Busek Centro de Pesquisas Ren? Rachou/FIOCRUZ -- Esta mensagem foi "escaneada" pelo MailScanner a procura de virus e codigo malicioso, e acredita-se que esteja "limpa". Servico de Informatica - CPqRR/FIOCRUZ. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggaz at cpqrr.fiocruz.br Wed Oct 9 21:15:32 2002 From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli) Date: Wed, 9 Oct 2002 18:15:32 -0300 Subject: jemboss Message-ID: <000801c26fd9$9e235fe0$6500a8c0@cpqrr.fiocruz.br> I would like to use the jemboss (interface java for emboss) but I need to enroll in HGPM and I don?t know how can I do this. Could you send me the email that I can do this? Thanks, Solange Busek Centro de Pesquisas Ren? Rachou/FIOCRUZ -- Esta mensagem foi "escaneada" pelo MailScanner a procura de virus e codigo malicioso, e acredita-se que esteja "limpa". Servico de Informatica - CPqRR/FIOCRUZ. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tcarver at hgmp.mrc.ac.uk Mon Oct 28 18:15:35 2002 From: tcarver at hgmp.mrc.ac.uk (Dr T. Carver) Date: Mon, 28 Oct 2002 18:15:35 +0000 (GMT) Subject: jemboss In-Reply-To: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br> Message-ID: Hi You can register at the HGMP by filling out the form at: http://www.hgmp.mrc.ac.uk/About/Registration/ Then send it to: UK MRC HGMP Resource Centre Hinxton Cambridge CB10 1SB UK You will then be sent an HGMP username and password. Regards Tim Carver On Wed, 9 Oct 2002, Prof. Giovanni Gazzinelli wrote: > I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this. > Could you help me? > Thanks, > Solange Busek > Centro de Pesquisas Ren? Rachou/FIOCRUZ > > -- > Esta mensagem foi "escaneada" pelo MailScanner a procura > de virus e codigo malicioso, e acredita-se que esteja "limpa". > Servico de Informatica - CPqRR/FIOCRUZ. > > From David.Lapointe at umassmed.edu Mon Oct 28 22:21:55 2002 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Mon, 28 Oct 2002 17:21:55 -0500 Subject: Emboss on Solaris. Message-ID: <13B2F22F9D5DD611B07700508BB1E88F019A2D7A@edunivexch02.umassmed.edu> We've moved to a Netra T1 and I am having problems with the PNG libraries. I get these runtime errors using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am I missing? $ prettyplot Displays aligned sequences, with colouring and boxing Input sequence set: opsin.msf Graph type [x11]: png libpng warning: Application was compiled with png.h from libpng-1.0.6 libpng warning: Application is running with png.c from libpng-1.2.4 gd-png: fatal libpng error: Incompatible libpng version in application and library David Lapointe Senior Informaticist / Information Services Assistant Professor / Cell Biology UMass Worcester (508) 856-5141 From David.Bauer at SCHERING.DE Tue Oct 29 06:37:00 2002 From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE) Date: Tue, 29 Oct 2002 07:37:00 +0100 Subject: Antwort: Emboss on Solaris. Message-ID: Hi, I also had some problems with this on Solaris. Did you try to run configure with "--with-pngdriver=DIR"?. This helps EMBOSS to pick the right header files. David. We've moved to a Netra T1 and I am having problems with the PNG libraries. I get these runtime errors using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am I missing? $ prettyplot Displays aligned sequences, with colouring and boxing Input sequence set: opsin.msf Graph type [x11]: png libpng warning: Application was compiled with png.h from libpng-1.0.6 libpng warning: Application is running with png.c from libpng-1.2.4 gd-png: fatal libpng error: Incompatible libpng version in application and library David Lapointe Senior Informaticist / Information Services Assistant Professor / Cell Biology UMass Worcester (508) 856-5141 From shibl at seqbio.com Wed Oct 30 16:13:08 2002 From: shibl at seqbio.com (Shibl Mourad) Date: Wed, 30 Oct 2002 11:13:08 -0500 Subject: Emboss Expert System Message-ID: <002c01c2802f$3fec6370$2602a8c0@SEQUENCE> Dear EMBOSS user, We are currently developing an expert system that will complement EMBOSS. As there are roughly 200 tools packaged within EMBOSS alone, the task to locate the 'right' tool, especially if you are newcomer to the bioinformatics field, can be overwhelming. Our expert system, openExpert, aims to simulate the 'question and answer' conversation one would have with a bioinformatics 'expert' - but minus their presence and wage. Although it is currently populated with only the EMBOSS suite, we aim to broaden the knowledge base of openExpert to encompass all known bioinformatics tools. We are looking for 5 EMBOSS users to review the system. The review should not take more than 30 minutes of your time and it would be of great value to us. If you are interested, please email shibl at seqbio.com. If you would like to try openExpert without providing a review, please indicate so in your email and we will provide with free access. Help us make openExpert a valuable expert system for bioinformatics. Thank you, Shibl Mourad, President Sequence Bioinformatics From newgene at bigfoot.com Thu Oct 31 17:43:06 2002 From: newgene at bigfoot.com (clwu) Date: Thu, 31 Oct 2002 11:43:06 -0600 Subject: emboss in cygwin Message-ID: <3DC16BAA.1050201@bigfoot.com> Hi, group, I am new to group. I tried to compile EMBOSS under win2K/cygwin but I failed. EMBOSS website at HGMP mentioned that "Richard Bruskiewich and Simon Kelley at the Sanger Centre have succeeded in compiling EMBOSS under Windows NT using the CygWin package. The resulting executables have been tested but not thoroughly enough for a release. Contact Richard Bruskiewich for more information. ". But I can not follow the link in this page to get help. Does anyone have the successful experience on this? Are there pre-complied executables for cygwin available, even part of those standalone programs? That will help me a lot. Thank you in advance. clwu