From fornai at biomed.unipi.it  Thu Oct  3 06:27:40 2002
From: fornai at biomed.unipi.it (Claudia Fornai)
Date: Thu, 3 Oct 2002 12:27:40 +0200
Subject: pepwindawall
Message-ID: <000301c26ac7$b5b26a00$060e7283@ttvgroup>

dear emboss
I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to usa from a suitable UNIX platform pepwindowall and aother programs.
Best regards,
Claudia Fornai


fornai at biomed.unipi.it

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021003/6f3c99a5/attachment.html 

From letondal at pasteur.fr  Fri Oct  4 02:05:39 2002
From: letondal at pasteur.fr (Catherine Letondal)
Date: Fri, 04 Oct 2002 08:05:39 +0200
Subject: pepwindawall 
In-Reply-To: Your message of "Thu, 03 Oct 2002 12:27:40 +0200."
             <000301c26ac7$b5b26a00$060e7283@ttvgroup> 
Message-ID: <200210040605.g9465duY106667@electre.pasteur.fr>


"Claudia Fornai" wrote:
> 
> dear emboss
> I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to =
> usa from a suitable UNIX platform pepwindowall and aother programs.
> Best regards,
> Claudia Fornai
> 
> fornai at biomed.unipi.it
> 

Hi Claudia,

I guess that the documentation contains many answers to your question
but if you use the Web interface provided here:
http://bioweb.pasteur.fr/seqanal/interfaces/pepwindowall.html
You will have the Unix command corresponding to your
parameters displayed in the results page.

Other EMBOSS programs are available from here:
http://bioweb.pasteur.fr/intro-uk.html
(where there are not only EMBOSS programs though)

--
Catherine Letondal -- Pasteur Institute Computing Center


From squiresb at macrogenics.com  Fri Oct  4 13:48:47 2002
From: squiresb at macrogenics.com (Burke Squires)
Date: Fri, 04 Oct 2002 12:48:47 -0500
Subject: Primer prediction problems...
Message-ID: <B9C33EAF.327B%squiresb@macrogenics.com>

Hello all,

I am trying to use EMBOSS to predict PCR primers. I have tried downloading
the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar
file and the primer3.0.9 tar and installing them. I get errors about a
broken pipe or no primer3_core file found?

Can I trouble someone to point out an install document on a website that
lists a current set of instructions on installing EMBOSS and primer3 (or
another primer prediction program)?

Thanks in advance!

Burke Squires


From gwilliam at hgmp.mrc.ac.uk  Mon Oct  7 04:17:32 2002
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Mon, 07 Oct 2002 09:17:32 +0100
Subject: Primer prediction problems...
References: <B9C33EAF.327B%squiresb@macrogenics.com>
Message-ID: <3DA1431C.A139446B@hgmp.mrc.ac.uk>

The primer3_core program needs to be on your path before you can run
eprimer3.

Gary

Burke Squires wrote:
> 
> Hello all,
> 
> I am trying to use EMBOSS to predict PCR primers. I have tried downloading
> the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar
> file and the primer3.0.9 tar and installing them. I get errors about a
> broken pipe or no primer3_core file found?
> 
> Can I trouble someone to point out an install document on a website that
> lists a current set of instructions on installing EMBOSS and primer3 (or
> another primer prediction program)?
> 
> Thanks in advance!
> 
> Burke Squires

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From avc at sanger.ac.uk  Mon Oct  7 06:50:40 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Mon, 07 Oct 2002 11:50:40 +0100
Subject: fasta splitter
Message-ID: <3DA16700.90280280@sanger.ac.uk>


Is there an emboss app to split a large fasta file into a set of smaller ones?
I'm combing the docs but can't see anything - it may be staring me in the
face...

thanks

Tony

-- 
##############################################################
Email: avc at sanger.ac.uk         # Webmaster,The Sanger Centre,
Tel: 01223 497512               # Hinxton, CAMBRIDGE CB10 1SA.
Fax: 01223 494919               # http://www.sanger.ac.uk/
##############################################################


From Thomas.Laurent at uk.lionbioscience.com  Mon Oct  7 07:02:02 2002
From: Thomas.Laurent at uk.lionbioscience.com (Thomas Laurent)
Date: Mon, 07 Oct 2002 12:02:02 +0100
Subject: fasta splitter
References: <3DA16700.90280280@sanger.ac.uk>
Message-ID: <3DA169AA.1040409@uk.lionbioscience.com>

Hi tony,
I think Splitter should do the job :
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html

Cheers,
Thomas

Tony Cox wrote:
> Is there an emboss app to split a large fasta file into a set of smaller ones?
> I'm combing the docs but can't see anything - it may be staring me in the
> face...
> 
> thanks
> 
> Tony
> 


From avc at sanger.ac.uk  Mon Oct  7 07:16:47 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Mon, 7 Oct 2002 12:16:47 +0100 (BST)
Subject: fasta splitter
In-Reply-To: <3DA169AA.1040409@uk.lionbioscience.com>
Message-ID: <Pine.OSF.4.44.0210071216180.879493-100000@cbi1a>

On Mon, 7 Oct 2002, Thomas Laurent wrote:

+>Hi tony,
+>I think Splitter should do the job :
+>http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html

almost, but not quite. This converts one file to many files containg one
sequence. I need something like a conversion of one file containing 1000 seqs to
10 files each  containing 100 seqs

Tony


+>
+>Cheers,
+>Thomas
+>
+>Tony Cox wrote:
+>> Is there an emboss app to split a large fasta file into a set of smaller ones?
+>> I'm combing the docs but can't see anything - it may be staring me in the
+>> face...
+>>
+>> thanks
+>>
+>> Tony
+>>
+>
+>

******************************************************
Tony Cox			Email:avc at sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************


From jweiner1 at ix.urz.uni-heidelberg.de  Mon Oct  7 07:47:33 2002
From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3)
Date: Mon, 7 Oct 2002 13:47:33 +0200 (METDST)
Subject: fasta splitter
In-Reply-To: <Pine.OSF.4.44.0210071216180.879493-100000@cbi1a>
Message-ID: <Pine.A41.4.42.0210071340490.34202-101000@aixterm1.urz.uni-heidelberg.de>

Hello,

> almost, but not quite. This converts one file to many files containg one
> sequence. I need something like a conversion of one file containing 1000
> seqs to 10 files each  containing 100 seqs

I wrote you a simple perl script which should do the job.  Save it to a
file and make it executable (I think you are using a Unix-based system,
aren't you?) with chmod a+x split.pl.  To be on the safe side, put it in a
new directory, and copy your sequence file to the same directory.  Now run

./split.pl <filename> <number of sequences>

...where filename is the name of the file containing your 1000+ sequences,
and <number of sequences> is the number of sequences you wish to have in
each produced file.  The produced file will have the same name as the
original file with the appendix .1, .2, .3 etc.

I tried the script and it seems to work fine.  Meet the power of Perl :-)

Regards,
j.

----)-\//-///-----------------------------------January-Weiner-3-------
Technologists often forget the general user. Technology is only as good as
the user experience. That is something that technology groups very often
forget... [ Linus Torvalds, taken from the GNOME Usability Project ]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: split.pl
Type: application/x-perl
Size: 849 bytes
Desc: 
Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021007/304bd4fb/attachment.bin 

From areagp61 at yahoo.it  Mon Oct  7 08:49:35 2002
From: areagp61 at yahoo.it (Graziano P.)
Date: Mon, 7 Oct 2002 14:49:35 +0200
Subject: Codon usage files
Message-ID: <000b01c26e00$03ee27f0$18105709@italy.ibm.com>

Hi all,
with backtranseq I can use different codon usage table selecting different
"codon usage
files" in the EMBOSS data path. Some files are self-explanating (for example
Ehuman.cut is the codon usage file name for Homo sapiens), but other files
are not so self-explanating
like Eacc.cut, Esma.cut, Eddi.cut, etc.
Is there any document that report informations about every file?

Thanks
Graziano Pappad?


______________________________________________________________________
Mio Yahoo!: personalizza Yahoo! come piace a te 
http://it.yahoo.com/mail_it/foot/?http://it.my.yahoo.com/


From md0nilhe at mdstud.chalmers.se  Mon Oct  7 09:10:33 2002
From: md0nilhe at mdstud.chalmers.se (Henrik Nilsson)
Date: Mon, 7 Oct 2002 15:10:33 +0200 (MET DST)
Subject: EMBASSY problem
Message-ID: <Pine.SOL.4.30.0210071508080.21060-100000@grosse.mdstud.chalmers.se>


Hello

I'm having major problems with compiling the PHYLIP package of EMBASSY.
Would anyone happen to have compiled it successfully on RedHat 7.3, and
would be willing to send me the executables?

hENRiK

--

          Written using

        VIM - Vi IMproved

           version 5.0

        http://www.vim.org


From ableasby at hgmp.mrc.ac.uk  Mon Oct  7 09:14:43 2002
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Mon, 7 Oct 2002 14:14:43 +0100 (BST)
Subject: Codon usage files
Message-ID: <200210071314.OAA29103@bromine.hgmp.mrc.ac.uk>

Not every file but most are described in the README file
from ftp://ftp.ebi.ac.uk/pub/databases/codonusage

You can use the EMBOSS program 'cutgextract' on the CUTG
database to get files with more meaningful (long) names.


Alan


From mathog at mendel.bio.caltech.edu  Mon Oct  7 10:49:05 2002
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Mon, 07 Oct 2002 07:49:05 -0700
Subject: fasta splitter
Message-ID: <E17yZBx-0003ii-00@mendel.bio.caltech.edu>


> 
> Is there an emboss app to split a large fasta file into a set of
smaller ones?
> I'm combing the docs but can't see anything - it may be staring me in the
> face...

This isn't an EMBOSS entry, but it will probably do what you want:

  ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c

There are some other fasta related utilities in the same directory.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From tmargus at ebc.ee  Mon Oct  7 13:54:12 2002
From: tmargus at ebc.ee (=?iso-8859-1?Q?T=F5nu_Margus?=)
Date: Mon, 7 Oct 2002 20:54:12 +0300
Subject: WWW  - Emma is not able to create SOME   temporary files
Message-ID: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee>

Hi,

I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, 
but emma didn not work correctly.

It gives an error:

Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file

It seems that by some reas?n it can not create a file under runs/temp directory.
Why not -  is for me unclea.  All other files are there. 

Files under catalog     runs/fileVxWbES$/) 

root at kobra:fileVxWbES$ ls -l
total 16
-rw-r--r--   1 www      java          915 Oct  7 20:51 8825A
-rw-r--r--   1 www      java            0 Oct  7 20:51 dendoutfile
-rw-r--r--   1 www      java          384 Oct  7 20:51 error
-rw-r--r--   1 www      java         2145 Oct  7 20:51 index.html
drwxr-xr-x   2 www      java         4096 Oct  7 20:51 input
-rw-r--r--   1 www      java            0 Oct  7 20:51 outseq

Command line clustalw works  ok
 
Is there a solution for this problem?


T?nu Margus 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021007/0f661e07/attachment.html 

From starksb at ebi.ac.uk  Mon Oct  7 14:58:15 2002
From: starksb at ebi.ac.uk (David Starks-Browning)
Date: Mon,  7 Oct 2002 19:58:15 +0100
Subject: WWW  - Emma is not able to create SOME   temporary files
In-Reply-To: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee>
References: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee>
Message-ID: <5473-Mon07Oct2002195815+0100-starksb@ebi.ac.uk>

On Monday 7 Oct 02, T?nu Margus writes:
> Hi,
> 
> I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, 
> but emma didn not work correctly.
> 
> It gives an error:
> 
> Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file
> 
> It seems that by some reas?n it can not create a file under runs/temp directory.
> Why not -  is for me unclea.  All other files are there. 
> 
> Files under catalog     runs/fileVxWbES$/) 
> 
> root at kobra:fileVxWbES$ ls -l
> total 16
> -rw-r--r--   1 www      java          915 Oct  7 20:51 8825A
> -rw-r--r--   1 www      java            0 Oct  7 20:51 dendoutfile
> -rw-r--r--   1 www      java          384 Oct  7 20:51 error
> -rw-r--r--   1 www      java         2145 Oct  7 20:51 index.html
> drwxr-xr-x   2 www      java         4096 Oct  7 20:51 input
> -rw-r--r--   1 www      java            0 Oct  7 20:51 outseq
> 
> Command line clustalw works  ok

You don't show the permissions of the directory itself (use 'ls -la').
It's the directory permissions that determine whether files can be
created.

However, this may not be the problem.  We have seen problems with emma
on Linux, because the underlying application, clustalw, cannot deal
with filenames that are 5 characters long on Linux.  String buffer
management bugs in emma cause it to emit garbage characters after the
filename to the open() system call.  With emma, you will see this when
emma's PID is 4 digits long.  (You won't see the garbage characters in
error messages.  You only see them under strace.)

Clustalw should be fixed.  If that won't happen, emma.c could be
modified to pad the temporary file name with enough extra characters
so that, regardless of Linux PID, emma will use temp filenames longer
than 5 characters.

I don't have a patch for the latest version of emma, because I applied
the workaround to an old (1.9.1) version of EMBOSS.  Emma.c has changed a bit
since then, although the change is still straightforward to apply.

If you think this is your problem, I can provide details on how to
modify emma.c.

Hope this helps.

Kind regards,
David

 -------------------------------------------------------------------
  David Starks-Browning                  | starksb at ebi.ac.uk
  EMBL Outstation --                     |
  The European Bioinformatics Institute  |
  Wellcome Trust Genome Campus           | tel: +44 (1223) 494 616
  Hinxton, Cambridge, CB10 1SD, UK       | fax: +44 (1223) 494 468
 -------------------------------------------------------------------


From tcarver at hgmp.mrc.ac.uk  Tue Oct  8 04:44:34 2002
From: tcarver at hgmp.mrc.ac.uk (Tim Carver)
Date: Tue, 08 Oct 2002 09:44:34 +0100
Subject: Jemboss Server Feedback
Message-ID: <3DA29AF2.BE48E5F5@hgmp.mrc.ac.uk>


It would be immensely useful if those who have setup a Jemboss server
could
provide some feedback to us. This is useful in providing some ideas for
the
future direction of its development and to give our funding body some
idea
of its usage at other sites. In particular the following information
would
be of use:


1. Nationality
2. Funding body and/or Organisation
3. Server Platform  O/S (linux, solaris, MacOSX, AIX, HP-UX....)
4. Type of installation - e.g. with unix authorisation
5. Number of users at your site using Jemboss
6. Comments
     - what where you using before & why you changed
     - likes, dislikes & suggestions for Jemboss development (server &
client)


Many thanks in advance,
Tim Carver

HGMP-RC


From mq1 at sanger.ac.uk  Tue Oct  8 09:00:06 2002
From: mq1 at sanger.ac.uk (Mike Quail)
Date: Tue, 8 Oct 2002 14:00:06 +0100
Subject: restriction mapping
Message-ID: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk>

Hi

I am currently looking to isolate restriction fragments that cover gaps that are left in several genomes. To do this I need to cut the sequence we have of a genome with all known database enzymes and then select those that just cut a few times and in the right place so as to excise the region of the genome I require. 

GCG programs map and mapplot were excellent for doing this. Map in particular is good as it gives a graphical plot for each enzyme (one enzyme per line) plotting all the enzymes on a page or two so you can rapidly see which is appropriate.

I have tried the EMBOSS programs and basically they are no use. REMAP does what I want but in too great detail (the output would stretch round the globe) and RESTRICT is too unordered in its output. 

I have got a program called oligo on my PC that will do this, BUT it has problems with big sequences. Recently I tried analysing a 1.5Mb chromosome and it would only work if I limited the number of enzymes to 6 or less. So I could transfer the data over to my PC and try with that but as this organism is 5Mb it will be very slow going.

Have you any ideas of how this could be done in EMBOSS.

M.Quail


Project Leader

Wellcome Trust Sanger Institute

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021008/3f570826/attachment.html 

From peter.rice at uk.lionbioscience.com  Tue Oct  8 09:30:31 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 14:30:31 +0100
Subject: restriction mapping
References: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk>
Message-ID: <3DA2DDF7.70604@uk.lionbioscience.com>

Mike Quail wrote:
> I am currently looking to isolate restriction fragments that cover gaps 
> that are left in several genomes. To do this I need to cut the sequence 
> we have of a genome with all known database enzymes and then select 
> those that just cut a few times and in the right place so as to excise 
> the region of the genome I require.
> 
> Have you any ideas of how this could be done in EMBOSS.

You just need to know the enzymes that only cut twice, for example?

% restrict -min 2 -max 2 -plasmid

(the -plasmid may look odd, but it means "circular DNA" and says nothing 
about the size :-)

You can also check each enzyme one at a time afterwards:

% restrict -plasmid -fragment -enzyme BssHI

... the -fragment option includes the fragment sizes at the end of the 
report. You will need the positions and the fragment sizes to choose an enzyme.

You can select other report formats (-rformat), but the default is probably 
the most useful for your case (-rformat EMBL or GFF, for example, will miss 
the -fragment output)

Meanwhile, a graphical view could be nice so you can look for restriction 
sites on screen.
We can look into that.

Hope this helps,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From jonas.andersson at rocketmail.com  Tue Oct  8 10:35:50 2002
From: jonas.andersson at rocketmail.com (Jonas Andersson)
Date: Tue, 8 Oct 2002 07:35:50 -0700 (PDT)
Subject: Not compiling?
Message-ID: <20021008143550.40367.qmail@web40110.mail.yahoo.com>

When I try to compile the latest EMBOSS this is what I get. What do I
do wrong, given that I do as is suggested on the EMBOSS pages?


-MT ajreport.lo -MD -MP -MF .deps/ajreport.TPlo -o ajreport.o
>/dev/null 2>&1
make[1]: *** [ajreport.lo] Error 1
make[1]: Leaving directory `/home/henrik/temp/emboss/EMBOSS-2.5.1/ajax'
make: *** [all-recursive] Error 1

/ Jonas

__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com


From avc at sanger.ac.uk  Tue Oct  8 11:10:22 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Tue, 8 Oct 2002 16:10:22 +0100 (BST)
Subject: fasta splitter 
In-Reply-To: <Pine.A41.4.42.0210081639270.31834-100000@aixterm1.urz.uni-heidelberg.de>
Message-ID: <Pine.OSF.4.44.0210081607100.919344-100000@cbi1a>

On Tue, 8 Oct 2002, January Weiner 3 wrote:

Thanks to all that responded. I did, in the end write a 12 line bioperl script
to split my fasta file. My request seems, however, to highlight a small blind
spot on the EMBOSS radar. It appears that there are a number of implementations
out there - perhaps one of them can be donated to the emboss project as the
basis of a new software tool?

Tony

+>Hi,
+>
+>> This is apparently something that is frequently asked by biologists.
+>> If you call it fastasplitter, I have a Web interface ready for it:
+>> http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html
+>> If you think it's interesting, I install it, and in such case, I will
+>> put your name (J. Weiner ?) on the Web interface.
+>
+>No problem, do it, it's freeware (not even GPL :-).  However, if you think
+>that such a tool is useful, then I'll rewrite it in C -- to make it faster.
+>If I may suggest -- it'd be nice if you could download or get the produced
+>files as a tgz or zip archive.
+>
+>j.
+>
+>----)-\//-///-----------------------------------January-Weiner-3-------
+>Wysz?a Ho?? i Czyst?, wr?ci?a Wsp?ln? i Nieca?? [ (C) by moja babcia ]
+>
+>

******************************************************
Tony Cox			Email:avc at sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************


From Joerg.Schaber at uv.es  Tue Oct  8 11:58:11 2002
From: Joerg.Schaber at uv.es (Joerg Schaber)
Date: Tue, 08 Oct 2002 17:58:11 +0200
Subject: loading DDBJ data into EMBOSS
Message-ID: <3DA30093.6080404@uv.es>

Hi,

i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. 
ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using 
'dbiflat -idformat gb'. I get a warning for all entries in the flatfile
'Warning: Duplicate ID skipped: '<null>' All hits will point to first ID 
found? and I can not retrieve any sequence. I think dbiflat only 
recognizes the first entry.
When I download the corresponding fasta flatfile I have no problems 
creating an EMBOSS database using 'dbifasta'. However, I would like to 
use the original DDBJ flatfile because it includes more information.
Any idea what's the problem?

greetings,

joerg


From peter.rice at uk.lionbioscience.com  Tue Oct  8 12:08:47 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 17:08:47 +0100
Subject: loading DDBJ data into EMBOSS
References: <3DA30093.6080404@uv.es>
Message-ID: <3DA3030F.2030808@uk.lionbioscience.com>

Joerg Schaber wrote:
> Hi,
> 
> i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. 
> ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using 
> 'dbiflat -idformat gb'. I get a warning for all entries in the flatfile
> 'Warning: Duplicate ID skipped: '<null>' All hits will point to first ID 
> found? and I can not retrieve any sequence. I think dbiflat only 
> recognizes the first entry.
> When I download the corresponding fasta flatfile I have no problems 
> creating an EMBOSS database using 'dbifasta'. However, I would like to 
> use the original DDBJ flatfile because it includes more information.
> Any idea what's the problem?

Yes ... that file is not in Genbank or DDBJ format!!!!

It looks more like a CODATA format, but only the ENTRY is recognized.
If you can find a name for it, we could probably implements a new 
input/output sequence format ... but it has some horrible features that 
will not be general.

Example entry:

ENTRY       BU002             CDS       Buchnera
NAME        atpB
DEFINITION  ATP synthase A chain [EC:3.6.3.14] [SP:ATP6_BUCAI]
CLASS       Metabolism; Energy Metabolism; Oxidative phosphorylation
             [PATH:buc00190]
             Metabolism; Energy Metabolism; ATP synthesis [PATH:buc00193]
             Metabolism; Energy Metabolism; Photosynthesis [PATH:buc00195]
POSITION    2278..3102
DBLINKS     RIKEN: BU002
             NCBI: 10038695
CODON_USAGE       T               C               A               G
           T  27   2  22   7  11   0   7   1   7   1   1   0   1   0   0   5
           C   4   0   3   2   6   1   4   2   5   1   8   2   1   0   2   0
           A  28   0   5  12   5   0   3   0   7   3  13   1   4   1   0   0
           G   4   1  12   3   5   1   5   0   8   0   7   1   7   2   4   0
AASEQ       274
             MILEKISDPQKYISHHLSHLQIDLRSFKIIQPGALSSDYWTVNVDSMFFSLVLGSFFLSI
             FYMVGKKITQGIPGKLQTAIELIFEFVNLNVKSMYQGKNALIAPLSLTVFIWVFLMNLMD
             LVPIDFFPFISEKVFELPAMRIVPSADINITLSMSLGVFFLILFYTVKIKGYVGFLKELI
             LQPFNHPVFSIFNFILEFVSLVSKPISLGLRLFGNMYAGEMIFILIAGLLPWWTQCFLNV
             PWAIFHILIISLQAFIFMVLTIVYLSMASQSHKD
NTSEQ       825
             atgattttagaaaagatatctgatcctcaaaaatatattagtcatcatttaagtcacttg
             cagatagatttgcgttcttttaaaattattcaaccaggtgcattgtcttctgattattgg
             actgtaaatgttgattcaatgtttttttctcttgtactgggtagtttttttttaagtatt
             ttttatatggtaggaaaaaaaattactcaaggtataccaggtaaattacaaactgcaatt
             gagttaatttttgaatttgtaaatttaaatgtaaaaagcatgtatcaaggtaaaaatgct
             cttattgcacctttatcattaacagtatttatttgggtttttttaatgaatctaatggat
             ttagttccgattgatttctttccatttatttctgaaaaagtgtttgaattacctgctatg
             cgaattgtaccttctgctgatattaatattacactatcaatgtcacttggcgtgtttttt
             ttaattttattttatactgttaaaattaaaggatatgtaggctttttaaaagaacttatt
             ttacaacctttcaaccatcctgtattttctatttttaattttatattagaatttgtgtca
             ttggtctcgaaacccatttctttgggattgcgattatttggaaacatgtacgcaggtgaa
             atgatttttattttaattgcaggtttgctgccatggtggacacaatgttttttaaacgta
             ccgtgggctatttttcatattttaataatttcactacaggcttttatttttatggtatta
             actattgtatatttatcaatggcctctcaatctcataaagattaa
///


-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From peter.rice at uk.lionbioscience.com  Tue Oct  8 12:37:36 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 17:37:36 +0100
Subject: fasta splitter
References: <Pine.OSF.4.44.0210081607100.919344-100000@cbi1a>
Message-ID: <3DA309D0.7@uk.lionbioscience.com>

Tony Cox wrote:
> On Tue, 8 Oct 2002, January Weiner 3 wrote:
> 
> Thanks to all that responded. I did, in the end write a 12 line bioperl script
> to split my fasta file. My request seems, however, to highlight a small blind
> spot on the EMBOSS radar. It appears that there are a number of implementations
> out there - perhaps one of them can be donated to the emboss project as the
> basis of a new software tool?

Nobody suggested hacking "seqret" to do what you want...

One problem doing this in EMBOSS is the need to generate filenames for your 
  split files - but maybe a base filename would be enough to generate 
names. Then all you need to do is count sequences in a modified seqret.c 
and change the output file. You can add a command line option for the 
number of sequences in an output file. Cleaning up output files for a rerun 
is an exercise for the user (unless you want to invent a new ACD type that 
does it :-)

Needs a modified version of the seqFileReopen function to handle the file 
naming, but nothing complicated is involved.

regards.

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From avc at sanger.ac.uk  Tue Oct  8 12:39:31 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Tue, 8 Oct 2002 17:39:31 +0100 (BST)
Subject: fasta splitter
In-Reply-To: <3DA309D0.7@uk.lionbioscience.com>
Message-ID: <Pine.OSF.4.44.0210081738450.919344-100000@cbi1a>

On Tue, 8 Oct 2002, Peter Rice wrote:

that sounds excellent - does this mean it really will make it in to the EMBOSS
release? (any idea when? ;)

Tony


+>Tony Cox wrote:
+>> On Tue, 8 Oct 2002, January Weiner 3 wrote:
+>>
+>> Thanks to all that responded. I did, in the end write a 12 line bioperl script
+>> to split my fasta file. My request seems, however, to highlight a small blind
+>> spot on the EMBOSS radar. It appears that there are a number of implementations
+>> out there - perhaps one of them can be donated to the emboss project as the
+>> basis of a new software tool?
+>
+>Nobody suggested hacking "seqret" to do what you want...
+>
+>One problem doing this in EMBOSS is the need to generate filenames for your
+>  split files - but maybe a base filename would be enough to generate
+>names. Then all you need to do is count sequences in a modified seqret.c
+>and change the output file. You can add a command line option for the
+>number of sequences in an output file. Cleaning up output files for a rerun
+>is an exercise for the user (unless you want to invent a new ACD type that
+>does it :-)
+>
+>Needs a modified version of the seqFileReopen function to handle the file
+>naming, but nothing complicated is involved.
+>
+>regards.
+>
+>Peter
+>
+>--
+>------------------------------------------------
+>Peter Rice, LION Bioscience Ltd, Cambridge, UK
+>peter.rice at uk.lionbioscience.com +44 1223 224723
+>

******************************************************
Tony Cox			Email:avc at sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************


From peter.rice at uk.lionbioscience.com  Tue Oct  8 12:42:39 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 17:42:39 +0100
Subject: fasta splitter
References: <Pine.OSF.4.44.0210081738450.919344-100000@cbi1a>
Message-ID: <3DA30AFF.3010100@uk.lionbioscience.com>

Hi Tony

> that sounds excellent - does this mean it really will make it in to the EMBOSS
> release? (any idea when? ;)

I already have the first part of the code ... a modified "seqret" to split 
into 10 sequences per file.

Working copy is called "tenco" :-)

What did you have in mind as a naming convention for the output files? The 
existing code names each file after the first sequence, I guess you want 
"outfile.1" "outfile.2" and so on, possibly with leading zeroes 
"outfile.,001" etc.

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From letondal at pasteur.fr  Tue Oct  8 14:11:11 2002
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue, 08 Oct 2002 20:11:11 +0200
Subject: fasta splitter 
In-Reply-To: Your message of "Mon, 07 Oct 2002 13:47:33 +0200."
             <Pine.A41.4.42.0210071340490.34202-101000@aixterm1.urz.uni-heidelberg.de> 
Message-ID: <200210081811.g98IBBuY253618@electre.pasteur.fr>


January Weiner 3 wrote:
> 
> Hello,
> 
> > almost, but not quite. This converts one file to many files containg one
> > sequence. I need something like a conversion of one file containing 1000
> > seqs to 10 files each  containing 100 seqs
> 
> I wrote you a simple perl script which should do the job.  Save it to a
> file and make it executable (I think you are using a Unix-based system,
> aren't you?) with chmod a+x split.pl.  To be on the safe side, put it in a
> new directory, and copy your sequence file to the same directory.  Now run
> 
> ./split.pl <filename> <number of sequences>
> 
> ...where filename is the name of the file containing your 1000+ sequences,
> and <number of sequences> is the number of sequences you wish to have in
> each produced file.  The produced file will have the same name as the
> original file with the appendix .1, .2, .3 etc.
> 
> I tried the script and it seems to work fine.  Meet the power of Perl :-)

For information, I have installed this script and the program is
available at: http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html

--
Catherine Letondal -- Pasteur Institute Computing Center


From avc at sanger.ac.uk  Tue Oct  8 14:38:50 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Tue, 8 Oct 2002 19:38:50 +0100
Subject: fasta splitter
References: <Pine.OSF.4.44.0210081738450.919344-100000@cbi1a> <3DA30AFF.3010100@uk.lionbioscience.com>
Message-ID: <000d01c26ef9$f206f710$0a00a8c0@zeus>


----- Original Message -----
From: "Peter Rice" <peter.rice at uk.lionbioscience.com>
To: "Tony Cox" <avc at sanger.ac.uk>
Cc: "January Weiner 3" <jweiner1 at ix.urz.uni-heidelberg.de>;
<pise at pasteur.fr>; <emboss at embnet.org>
Sent: Tuesday, October 08, 2002 5:42 PM
Subject: Re: fasta splitter


> Hi Tony
>
> > that sounds excellent - does this mean it really will make it in to the
EMBOSS
> > release? (any idea when? ;)
>
> I already have the first part of the code ... a modified "seqret" to split
> into 10 sequences per file.
>
> Working copy is called "tenco" :-)
>
> What did you have in mind as a naming convention for the output files? The
> existing code names each file after the first sequence, I guess you want
> "outfile.1" "outfile.2" and so on, possibly with leading zeroes
> "outfile.,001" etc.

Hi Peter,

This sounds great to me. Personally, I'd prefer not to have the leading
zeros - just an incrementing ".[integer]" appended to the filename supplied.
Makes shell manipulation easier.

I guess the ideal would able to supply either a number of chunks to split
the file in to or else specify a maximum size (either in bytes or fasta
entries) for each chunk.

cheers

Tony


From mathog at mendel.bio.caltech.edu  Tue Oct  8 15:00:12 2002
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Tue, 08 Oct 2002 12:00:12 -0700
Subject: fasta splitter
Message-ID: <E17yzaW-0004O6-00@mendel.bio.caltech.edu>

There's more than one way to split a fasta file...

1.  Split M entries into N files, file 1 receives 1->M/N,
file 2 receives M/N+1->2M/N, etc. Advantages - only one
file needs to be open at a time, simple.  Disadvantage -
the resulting split is typically uneven.  Do this with the
NCBI databases and you'll find that they are heavily weighted
for smaller sequences at the beginning and longer ones at the
end.  If the point of the split is to load balance (this is
what I use it for, with parallel BLAST) some nodes will finish
much earlier than others. Implementation: (deleted, I found
this method not to be generally useful)

1b.  head/tail/segment entries out of a fasta file.  While (1)
caused a lot of problems I've often needed to chop out a specific
part of a fasta file.  Why?  Because some piece of software was
blowing up on the 351,234 entry, but only if preceded by several
thousand other entries. Finding the smallest piece that will trigger the
bug can save hours of run time debugging these sorts of problems. 
Implementation:

   ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c

2.  Split M entries into N files, cycling output to each file.
That is, entry M goes to file M modulo N.  Advantage - resulting
files tend to be more even in size.  Disadvantage - N output files
must be open at once (or you have to cycle through N times, once
per phase); if M is small and the size of each entry large the
resulting files will not generally be balanced.  Example, splitting
the yeast genome, heaven help us when full length human chromosomes
start showing up as single FASTA file entries. Implementation:

  ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c

3.  Split P bases in M entries into N files "evenly", fragmenting
sequences if they are too large.  Advantage:  fixes the genome
data problem from (2). Disadvantages:  even more complex than
(2) and "entries" in resulting files do not correspond one to
one with the original. Even with clever naming conventions 
(yeastII_100001_200000) end users will be confused.  Clever
names will be truncated by most software at the worst possible
place resulting in a "hit" on "yeastII_" :-(.  Implemenation:
(well, partially, this one translates in all 6 frames, but
it has some of the naming/fragmenting features):

   ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c

4.  Split by content.  Ie, strip all the human sequences out
of nr.  I don't beleive there is a general solution because there
is no univerasally agreed upon FASTA header line format.
Implementation:  SRS or something similar.


Regards,


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From squiresb at macrogenics.com  Tue Oct  8 15:20:35 2002
From: squiresb at macrogenics.com (Burke Squires)
Date: Tue, 08 Oct 2002 14:20:35 -0500
Subject: eprimer3...broken pipe?
In-Reply-To: <E17yzaW-0004O6-00@mendel.bio.caltech.edu>
Message-ID: <B9C89A33.33C4%squiresb@macrogenics.com>

I have tried to install various version of emboss and when I try and run
eprimer3 I get the following message:

[loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3
Picks PCR primers and hybridization oligos
Input sequence(s): /bioinfo/fragments.fa
Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out

   EMBOSS An error in eprimer3.c at line 317:
The program 'primer3_core' must be on the path.
It is part of the 'primer3' package,
available from the Whitehead Institute.
See: http://www-genome.wi.mit.edu/
Broken pipe


Does anybody know how to fix this?

Thanks!

Burke Squires

-- 
Burke Squires
Bioinformatics
MacroGenics, Inc.
2600 Stemmons Freeway, Suite 210
Dallas, TX 75235 USA
Work: 214-634-3000 X224
Squiresb @ macrogenics.com (Please remove spaces to use)
www.macrogenics.com
----------------------------------------------------------------------------
This e-mail and any attachments may be confidential or legally privileged.
If you received this message in error or are not the intended recipient, you
should destroy the e-mail message and any attachments or copies, and you are
prohibited from retaining, distributing, disclosing or using any information
contained herein.  Please inform us of the erroneous delivery by return
e-mail. 

Thank you for your cooperation. 


From tchiang at bioinfo.sickkids.on.ca  Tue Oct  8 15:31:18 2002
From: tchiang at bioinfo.sickkids.on.ca (Ted Chiang)
Date: Tue, 8 Oct 2002 15:31:18 -0400 (EDT)
Subject: cusp
Message-ID: <Pine.GSO.4.05.10210081522430.8600-100000@kenny>


Hi,

I have question about the Emboss program cusp.  The program creates a
codon usage table based on the "coding" sequence of the input file.  My
question is how does it determine where the 'coding' (or ORF) sequence
given any DNA sequence when one executes the program without specifying
the -sbeg and -send flags.

ie.

$cusp dna_seq  

How does cusp determine where the coding sequence begins?

As opposed to 

$cusp dna_seq -sbegin 135 -send 192

where the coding sequence is specified.  In the latter case, how does if
the specified region is not divisible by 3, does cusp ignore the latter
few nucleotides?


Thanks.

-Ted


=====================================
Ted Chiang, Analyst
Centre for Computational Biology 
Hospital for Sick Children, Toronto
416.813.7028
tchiang at bioinfo.sickkids.on.ca
=====================================


From sebastian.bassi at ar.advantaseeds.com  Tue Oct  8 16:05:43 2002
From: sebastian.bassi at ar.advantaseeds.com (Sebastian Bassi)
Date: Tue, 8 Oct 2002 22:05:43 +0200
Subject: fasta splitter
Message-ID: <BF7C473F341130418CAE514276C099DB06412A@e2knl1.nl.seedsnetwork.com>

> What did you have in mind as a naming convention for the 
> output files? The 
> existing code names each file after the first sequence, I 
> guess you want 
> "outfile.1" "outfile.2" and so on, possibly with leading zeroes 
> "outfile.,001" etc.

My $.02: I think that outfile.[number_here] is not a good convention, since the extension (whatever you put after the dot) means the file type, and here the file type is always the same (ASCII text). I think it should be something like:
outfile_[number].txt
It should look like this:
outfile_1.txt
outfile_2.txt
outfile_3.txt

Anyway, IANAP (I am not a programmer) I'm just an end user and I'm stating this from a user consistency view point. If I have two mp3 files (xsongpart1 and xsongpart2) I would name them as part1.mp3 and part2.mp3 and NOT xsong.1 and xsong.2


From mathog at mendel.bio.caltech.edu  Tue Oct  8 16:33:56 2002
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Tue, 08 Oct 2002 13:33:56 -0700
Subject: fasta splitter
Message-ID: <E17z13E-0004aH-00@mendel.bio.caltech.edu>


> > What did you have in mind as a naming convention for the 
> > output files? The 
> > existing code names each file after the first sequence, I 
> > guess you want 
> > "outfile.1" "outfile.2" and so on, possibly with leading zeroes 
> > "outfile.,001" etc.
> 
> My $.02: I think that outfile.[number_here] is not a good convention,
since the extension (whatever you put after the dot) means the file
type, and here the file type is always the same (ASCII text). I think it
should be something like:
> outfile_[number].txt
> It should look like this:
> outfile_1.txt


I agree.  Also the numeric range should be displayed
in a fixed column width.  Ideally something like:

  % esplit \
     -sequence=ncbi_nr.nfa \
     -fmask='nr_frag_####.nfa' \
     -spitn=20 \
     -splitmode=cycle \
     -numberfrom=0

would produce

nr_frag_0000.nfa
...
nr_frag_0019.nfa

Keeping the names fixed width prevents all sorts of
text alignment problems which can show up otherwise.

Regards,


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From fernan at iib.unsam.edu.ar  Tue Oct  8 18:51:16 2002
From: fernan at iib.unsam.edu.ar (Fernan Aguero)
Date: Tue, 8 Oct 2002 19:51:16 -0300
Subject: fasta splitter
In-Reply-To: <3DA309D0.7@uk.lionbioscience.com>
References: <Pine.OSF.4.44.0210081607100.919344-100000@cbi1a> <3DA309D0.7@uk.lionbioscience.com>
Message-ID: <20021008225116.GA273@iib.unsam.edu.ar>

+----[ Asi hablaba Peter Rice (peter.rice at uk.lionbioscience.com):
|

[ snipped ]

| 
| One problem doing this in EMBOSS is the need to generate filenames for your 
|  split files - but maybe a base filename would be enough to generate 
| names. 

Now let me get myself into the discussion. The splitter I use is
called 'shatter' and is part of the SEALS package, which I
guess is unmaintained (and perhaps obsolete?)  and is
basically perl. 
ftp://ftp.ncbi.nih.gov/pub/walker/seals/software

The following discussion works for splitting into individual
sequences, but not into groups of sequences. In this case a
different naming scheme should be used, (though perhaps the
same argument specifier '-word' could be used?). 

The approach of shatter (both for splitting FASTA files, but
also for splitting concatenated BLAST reports, which are
splitted by 'shatterblast') is to let you choose the 'word'
which will be used as a basename. Both shatters know about
the NCBI FASTA standard and thus, given a FASTA header like
the following:
>gi|123456|gb|AA123456|AA123456.1 Homo sapiens protein X etc

will take the gi as word 2 (123456), the accession number
(AA123456) as word 4, the accession.version (AA123456.1) as
word 5 and so on. 

In the command-line you just say 'shatter -word 1 fastafile'
if you want the first word after the '>' to be the basename.

This produces files with that basename and terminated in .fa

The program will consider whitespace and the character '|'
as word delimiters.

In my own experience this is a good thing. I've used shatter
with many different FASTA flavours and adjusting the word to
be used as basename is plain easy.

BLAST reports are also trivial since query sequences, are
also usually in FASTA format, and you get basically the same
header, though after the 'Query=' magic word. In this case
you get files with the same basename, but ending in .br

Just my 2 cents. Hope this makes it into EMBOSS.

Fernan


| and change the output file. You can add a command line option for the 
| number of sequences in an output file. Cleaning up output files for a rerun 
| is an exercise for the user (unless you want to invent a new ACD type that 
| does it :-)
| 
| Needs a modified version of the seqFileReopen function to handle the file 
| naming, but nothing complicated is involved.
| 
| regards.
| 
| Peter
| 
| -- 
| ------------------------------------------------
| Peter Rice, LION Bioscience Ltd, Cambridge, UK
| peter.rice at uk.lionbioscience.com +44 1223 224723
| 
| 
|
+----]

-- 
F e r n a n   A g u e r o
http://genoma.unsam.edu.ar/~fernan


From gwilliam at hgmp.mrc.ac.uk  Wed Oct  9 04:28:08 2002
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Wed, 09 Oct 2002 09:28:08 +0100
Subject: eprimer3...broken pipe?
References: <B9C89A33.33C4%squiresb@macrogenics.com>
Message-ID: <3DA3E898.C9318CEE@hgmp.mrc.ac.uk>

>From the eprimer3 documentation:

The Whitehead program must be set up and on the path in order for
eprimer3 to find and run it. 

The Whitehead Institute program that is run by this program is available
from:
http://www-genome.wi.mit.edu/genome_software/other/primer3.html 
(Then see the link 'Get release 0.9') 

The version that is run by this program is 3.0.9 currently available
from:
http://www-genome.wi.mit.edu/ftp/distribution/software/primer3_0_9_test.tar.gz 

Gary


Burke Squires wrote:
> 
> I have tried to install various version of emboss and when I try and run
> eprimer3 I get the following message:
> 
> [loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3
> Picks PCR primers and hybridization oligos
> Input sequence(s): /bioinfo/fragments.fa
> Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out
> 
>    EMBOSS An error in eprimer3.c at line 317:
> The program 'primer3_core' must be on the path.
> It is part of the 'primer3' package,
> available from the Whitehead Institute.
> See: http://www-genome.wi.mit.edu/
> Broken pipe
> 
> Does anybody know how to fix this?
> 
> Thanks!
> 
> Burke Squires
> 
> --
> Burke Squires
> Bioinformatics
> MacroGenics, Inc.
> 2600 Stemmons Freeway, Suite 210
> Dallas, TX 75235 USA
> Work: 214-634-3000 X224
> Squiresb @ macrogenics.com (Please remove spaces to use)
> www.macrogenics.com
> ----------------------------------------------------------------------------
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient, you
> should destroy the e-mail message and any attachments or copies, and you are
> prohibited from retaining, distributing, disclosing or using any information
> contained herein.  Please inform us of the erroneous delivery by return
> e-mail.
> 
> Thank you for your cooperation.

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From Joerg.Schaber at uv.es  Wed Oct  9 13:02:51 2002
From: Joerg.Schaber at uv.es (Joerg Schaber)
Date: Wed, 09 Oct 2002 19:02:51 +0200
Subject: swissprot
Message-ID: <3DA4613B.3010901@uv.es>

Hi,

can't load the SWISSPROT- bacteria database 
(ftp://ftp.ebi.ac.uk/pub/databases/swissprot/special_selections/bacteria.seq) 
into EMBOSS. I think EMBOSS is running well because I have no problem 
accessing the test-databases (see showdb below). However, I think 
somehow seqret is using the wrong division file but the PATH-setting 
seem to be correct.

greetings,

joerg
 

 > dbiflat
Index a flat file database
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: SWISS
Database directory [.]:
Wildcard database filename [*.dat]: *.seq
Database name: swissbac
Release number [0.0]: 1.0
Index date [00/00/00]: 09/10/02

 > ll
insgesamt 132100
 950883 drwxrwxr-x    2 root     users        4096 Okt  9 18:50 .
 623533 drwxrwxr-x    5 jos      jos          4096 Okt  9 17:50 ..
 950889 -rw-r--r--    1 jos      jos        189028 Okt  9 18:50 acnum.hit
 950888 -rw-r--r--    1 jos      jos        660456 Okt  9 18:50 acnum.trg
 623548 -rw-r--r--    1 jos      jos      133412511 Okt  9 18:25 
bacteria.seq
 950886 -rw-r--r--    1 jos      jos           322 Okt  9 18:50 division.lkp
 950887 -rw-r--r--    1 jos      jos        836840 Okt  9 18:50 entrynam.idx

 > showdb
Displays information on the currently available databases
# Name        Type ID  Qry All Comment
# ====        ==== ==  === === =======
swissbac      P    OK  OK  OK  SWISSPROT sequences of procaryotes 9/10/02
tpir          P    OK  OK  OK  PIR using NBRF access for 4 files
tsw           P    OK  OK  OK  Swissprot native format with EMBL CD-ROM 
index
tswnew        P    OK  OK  OK  Swissnew as 3 files in native format with 
EMBL CD-ROM index
twp           P    OK  OK  OK  EMBL new in native format with EMBL 
CD-ROM index
buch          N    OK  OK  OK  Buchnera database in DDBJ Format
fbuch         N    OK  OK  OK  Buchnera database in FASTA Format
tembl         N    OK  OK  OK  EMBL in native format with EMBL CD-ROM index
tgb           N    OK  -   -   Genbank IDs
tgenbank      N    OK  OK  OK  GenBank in native format with EMBL CD-ROM 
index

 > head bacteria.seq
ID   120K_RICRI     STANDARD;      PRT;  1300 AA.
AC   P14914;
--snipp

--snipp

 > seqret swissbac:120K_RICRI
Reads and writes (returns) sequences
Warning: Cannot open division file '<null>' for database 'swissbac'
Warning: seqCdQry failed
Error: Unable to read sequence 'swissbac:120K_RICRI'
 >

-- 
----------------------------------------------------------
Joerg Schaber
Instituto Cavanilles de Biodiversidad y Genetica Evolutiva
Universidad de Valencia               Tel.: ++34 96 398 3647
A.C. 22085                            Fax.: ++34 96 398 3670
46071 Valencia, Espa?a                email : jos at uv.es


From jweiner1 at ix.urz.uni-heidelberg.de  Thu Oct 10 05:17:51 2002
From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3)
Date: Thu, 10 Oct 2002 11:17:51 +0200 (METDST)
Subject: fasta splitter
In-Reply-To: <000d01c26ef9$f206f710$0a00a8c0@zeus>
Message-ID: <Pine.A41.4.42.0210101115070.37122-100000@aixterm1.urz.uni-heidelberg.de>

> This sounds great to me. Personally, I'd prefer not to have the leading
> zeros - just an incrementing ".[integer]" appended to the filename supplied.
> Makes shell manipulation easier.

Well, I'd prefer the former -- because it makes shell manipulation easier
:-)  If you stay with the leading 0's, then any listing will show the files
in the correct order, otherwise it will show "foo.1, foo.10, ...,
foo.100,...  foo.2, ..." etc.

j.


----)-\//-///-----------------------------------January-Weiner-3-------
"'Tis true, there's magic in the web of it." -- Shakespeare


From kenneth at geisshirt.dk  Mon Oct 14 07:02:15 2002
From: kenneth at geisshirt.dk (Kenneth Geisshirt)
Date: Mon, 14 Oct 2002 13:02:15 +0200 (CEST)
Subject: Splitting genbank
Message-ID: <Pine.LNX.4.44.0210141258100.361-100000@lithium>

Hi everyone

I recently joined the mailing list (after a couple of weeks usage of
EMBOSS) so I hope that my question isn't a FAQ.

I have a local copy of genbank, and I wish to split it into four
databases: one for humans, one of rats, one of mouses and one for the
rest. The applications seqret and seqretsplit can help me with the first
three by specifying the organism in the usa, but how do I specify "not
human and not rat and not mouse"?

Thanks in advance
  Kneth

-- 
Kenneth Geisshirt, M.Sc., Ph.D.         http://kenneth.geisshirt.dk
Gr?ndals Parkvej 2A, 3. sal                    kenneth at geisshirt.dk
DK-2720 Vanl?se                                     +45 38 87 78 38


From peter.rice at uk.lionbioscience.com  Mon Oct 14 07:27:34 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 14 Oct 2002 12:27:34 +0100
Subject: Splitting genbank
References: <Pine.LNX.4.44.0210141258100.361-100000@lithium>
Message-ID: <3DAAAA26.1080707@uk.lionbioscience.com>

Kenneth Geisshirt wrote:
> I have a local copy of genbank, and I wish to split it into four
> databases: one for humans, one of rats, one of mouses and one for the
> rest. The applications seqret and seqretsplit can help me with the first
> three by specifying the organism in the usa, but how do I specify "not
> human and not rat and not mouse"?

In EMBOSS ....

split the gbrod file into rat, mouse and other rodents (a simple perl 
script would do)

index and define GenBank

then define subsets using the same index files and exclude the ones you 
don't want using, for example:

exclude: "*pri* *rat* *mus*"

... in copies of your EMBOSS database definition for genbank.

EMBOSS simply checks the excluded files list when using the index files.

regards,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From Joerg.Schaber at uv.es  Mon Oct 14 08:11:59 2002
From: Joerg.Schaber at uv.es (Joerg Schaber)
Date: Mon, 14 Oct 2002 14:11:59 +0200
Subject: other indices
Message-ID: <3DAAB48F.6080704@uv.es>

Hi,

dbiflat allows to index other fields except id and accession number like 
sequence version (seqv), description (des), keywords and taxon. However, 
in the example databases that come with EMBOSS I found only field 
definitions like 'fields: "sv des org key"'. So do I access the 
additional indices (e.g. in seqret) via 'seqret-sv:\*', 
'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? 
'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*'  did not work.

Greetings,

joerg


From gwilliam at hgmp.mrc.ac.uk  Mon Oct 14 08:18:20 2002
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Mon, 14 Oct 2002 13:18:20 +0100
Subject: other indices
References: <3DAAB48F.6080704@uv.es>
Message-ID: <3DAAB60C.ACF4111E@hgmp.mrc.ac.uk>

See:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/UniformSequenceAddress.html#keys

You append the 'sv', 'des', 'org', 'key', etc to the database name with
a '-' and to a file name with a ':', so:

with a database you use a command like:

seqret embl-des:fau


with a file you use a command like:

seqret filename:org:homo


Gary

Joerg Schaber wrote:
> 
> Hi,
> 
> dbiflat allows to index other fields except id and accession number like
> sequence version (seqv), description (des), keywords and taxon. However,
> in the example databases that come with EMBOSS I found only field
> definitions like 'fields: "sv des org key"'. So do I access the
> additional indices (e.g. in seqret) via 'seqret-sv:\*',
> 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively?
> 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*'  did not work.
> 
> Greetings,
> 
> joerg

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From peter.rice at uk.lionbioscience.com  Mon Oct 14 08:20:40 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 14 Oct 2002 13:20:40 +0100
Subject: other indices
References: <3DAAB48F.6080704@uv.es>
Message-ID: <3DAAB698.1080108@uk.lionbioscience.com>

Joerg Schaber wrote:

> dbiflat allows to index other fields except id and accession number like 
> sequence version (seqv), description (des), keywords and taxon. However, 
> in the example databases that come with EMBOSS I found only field 
> definitions like 'fields: "sv des org key"'. So do I access the 
> additional indices (e.g. in seqret) via 'seqret-sv:\*', 
> 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? 
> 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*'  did not work.

For a database called schaber

dbiflat -fields "acnum,seqvn,des,keyword,taxon"

In the emboss.default definition:

DB schaber  [ type: P format: swiss method: emblcd
   dir: /data/schaber
   indexdir: /data/schaber
   comment: "Flatfiles database, all fields indexed"
   fields: "sv des org key"
]

In EMBOSS programs, use the USA:

'schaber-sv:\*'
'schaber-des:\*'
'schaber-org:\*'
'schaber-key:\*'

The confusion comes because the database definition (and the USA syntax) 
uses the field names in common use (e.g. in SRS) and dbiflat uses the 
EMBLCD/Staden index file names that dbiflat will be writing.

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From jmuehlis at uni-muenster.de  Tue Oct 15 05:03:19 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 15 Oct 2002 11:03:19 +0200
Subject: this format is not readable by seqret
Message-ID: <3DABD9D7.4101AA0D@uni-muenster.de>

Hello there,

my name is J?rg M?hlisch and I work in the Departement of pediatric
hematology and oncology at the University of Munster (Germany). As a
Scientist I use emboss on linux.

So here is my first question:

I have a sample of sequences in different formats. Before I try to index
them tested them for readablility by seqret:

find ./ -name "*" -exec seqret -osf fasta {} ../Sequencesothers/{} /;

Some of my files are not readable and I do not know the name of their
format:

Contig 1 (1,506)
  Contig Length:                  506 bases
  Average Length/Sequence:        458 bases
  Total Sequence Length:         1375 bases
  Top Strand:                       3 sequences
  Bottom Strand:                    0 sequences
  Total:                            3 sequences
^^
AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC...

May be there is a way to change this format in an apropriate way. 

Thanks

J?rg M?hlisch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021015/6754baf0/attachment.vcf 

From peter.rice at uk.lionbioscience.com  Tue Oct 15 05:16:32 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 15 Oct 2002 10:16:32 +0100
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de>
Message-ID: <3DABDCF0.6090809@uk.lionbioscience.com>

Joerg Muehlisch wrote:

> Some of my files are not readable and I do not know the name of their
> format:
> 
> Contig 1 (1,506)
>   Contig Length:                  506 bases
>   Average Length/Sequence:        458 bases
>   Total Sequence Length:         1375 bases
>   Top Strand:                       3 sequences
>   Bottom Strand:                    0 sequences
>   Total:                            3 sequences
> ^^
> AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC...
> 
> May be there is a way to change this format in an apropriate way. 

Should be possible, if the format is common enough.

Where does the file come from? Does this program/package have an option to 
save in one of the (many) 'standard' formats?

regards,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From jmuehlis at uni-muenster.de  Tue Oct 15 05:44:28 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 15 Oct 2002 11:44:28 +0200
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com>
Message-ID: <3DABE37C.B045FF6E@uni-muenster.de>

Hi,

in fact I hoped that anybody in the List would know where this format
comes from. In my file sample I just found some of thes unreadable
sequences.
As it does not seem to be a good known format, I will try to find out
where it is used.

Thanks

Jorg

Peter Rice wrote:

> Should be possible, if the format is common enough.
> 
> Where does the file come from? Does this program/package have an option to
> save in one of the (many) 'standard' formats?
> 
> regards,
> 
> Peter Rice
> 
> --
> ------------------------------------------------
> Peter Rice, LION Bioscience Ltd, Cambridge, UK
> peter.rice at uk.lionbioscience.com +44 1223 224723
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021015/ff4ad8b4/attachment.vcf 

From kdj at sanger.ac.uk  Tue Oct 15 06:38:46 2002
From: kdj at sanger.ac.uk (Keith James)
Date: 15 Oct 2002 11:38:46 +0100
Subject: this format is not readable by seqret
In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de>
References: <3DABD9D7.4101AA0D@uni-muenster.de>
	<3DABDCF0.6090809@uk.lionbioscience.com>
	<3DABE37C.B045FF6E@uni-muenster.de>
Message-ID: <sc4u1joasw9.fsf@deskpro69.internal.sanger.ac.uk>

>>>>> "Joerg" == Joerg Muehlisch <jmuehlis at uni-muenster.de> writes:

    Joerg> Hi, in fact I hoped that anybody in the List would know
    Joerg> where this format comes from. In my file sample I just
    Joerg> found some of thes unreadable sequences.  As it does not
    Joerg> seem to be a good known format, I will try to find out
    Joerg> where it is used.

I _think_ this may be flatfile output from DNAStar/Lasergene. It's
been a while since I've seen any files like that but the ^^ delimiter
reminded me of it.

I don't have acces to the package to verify this.

Keith

-- 

- Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -


From jrvalverde at cnb.uam.es  Tue Oct 15 08:09:12 2002
From: jrvalverde at cnb.uam.es (Jos� R. Valverde)
Date: Tue, 15 Oct 2002 14:09:12 +0200
Subject: this format is not readable by seqret
In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de>
References: <3DABD9D7.4101AA0D@uni-muenster.de>
	<3DABDCF0.6090809@uk.lionbioscience.com>
	<3DABE37C.B045FF6E@uni-muenster.de>
Message-ID: <20021015140912.7294cd80.jrvalverde@cnb.uam.es>

On Tue, 15 Oct 2002 11:44:28 +0200
Joerg Muehlisch <jmuehlis at uni-muenster.de> wrote:

> Hi,
> 
> in fact I hoped that anybody in the List would know where this format
> comes from. In my file sample I just found some of thes unreadable
> sequences.
> As it does not seem to be a good known format, I will try to find out
> where it is used.
> 
May be it would help if you were able to post a full file sample.
>From the fragments you posted it looked like a sequencing project
file. It mentioned a contig size, with many gel readings of average
length and the orientation coverage of gels (+/- strands).

Iff the sequence contained (you only included a few bases) is just
the consensus, i.e. a single sequence of length exactly equal the
consensus length, then conversion should be trivial to any format.
Simply do a 'tail + 8 {}' 

Otherwise it might contain the gel readings (and the consensus?),
and then it would be a multiple sequence file, possibly with gel
overlaps et al. and conversion may be a bit more difficult. It may
be also that more than one contig and associated files is included in
one file, making processing more difficult.

Initially I would expect the second choice to be true, from the header:
several short sequences making up a contig plus the consensus, in your
example, the first contig would be 506 bases, composed of three gels
of average length 458. Since 1375/3 = 458, I deduce that the consensus
sequence is not included. Therefore you have a multiple sequence file
of overlapping gel readings.

You may try this:

	1) find out if more than one contig is in the file
	2) find out how sequences are separated
	3) decide what you want to do with them, e.g.
		split the file at "^Contig " lines
		strip comment lines (^*:*$)
		split at sequence separators

see csplit(1) for details on how to do it on a pipeline. E.g.
assuming sequences are delimited by a blank line, this _might_
work:
	csplit file /^Contig / -f config
	foreach i ( contig.* )
		tail +8 $i | csplit - /\
\
/ -f ${i}.gel
	end
(note that we need to scape newlines directly) and you'd get the raw 
sequences all right as contig.##.gel.##

				j


From jmuehlis at uni-muenster.de  Tue Oct 15 09:50:03 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 15 Oct 2002 15:50:03 +0200
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de>
		<3DABDCF0.6090809@uk.lionbioscience.com>
		<3DABE37C.B045FF6E@uni-muenster.de> <sc4u1joasw9.fsf@deskpro69.internal.sanger.ac.uk>
Message-ID: <3DAC1D0B.3ACD64FD@uni-muenster.de>

Keith James wrote:
Yes I think that might be, I think our collaboration Group is working
with DNAStar. But nevertehless there does not seem to be an emboss way
to change the file format. So I will try it with Linux tools like tr.
Thanks for your help.

Jorg
> I _think_ this may be flatfile output from DNAStar/Lasergene. It's
> been a while since I've seen any files like that but the ^^ delimiter
> reminded me of it.
> 
> I don't have acces to the package to verify this.
> 
> Keith
> 
> --
> 
> - Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
> - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021015/6b7822b9/attachment.vcf 

From gbottu at ben.vub.ac.be  Mon Oct 21 04:34:38 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Mon, 21 Oct 2002 10:34:38 +0200 (CEST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210834.KAA1459646@black.vub.ac.be>

from : BEN

	Dear colleagues,
	
While doing some experimenting with fuzzpro, I tried the following :
-----------------
Input sequence(s): sw:pap?_carpa
Search pattern: <M(0,1)-A.
Number of mismatches [0]:
Output report [pap2_carpa.fuzzpro]:

   EMBOSS An error in embpat.c at line 725:
Unrecognised character in <M(0,1)-A
------------------
Yet I think I respected the PROSITE syntax. Anyone an idea ?

	Guy Bottu


From gbottu at ben.vub.ac.be  Mon Oct 21 04:34:46 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Mon, 21 Oct 2002 10:34:46 +0200 (CEST)
Subject: question about prophecy
Message-ID: <200210210834.KAA1459584@black.vub.ac.be>

from : BEN

	Dear colleagues,
	
I was looking at what the program prophecy is doing and I am puzzled. What is 
the difference between Gribskov and Henikoff profiles ? Both seem to have 
match/mismatch scores computed with the help of a scoring matrix as well as gap 
penalties. Furthermore, I thought that the Henikoff's made the Blocks databank 
using pprofiles without gaps. Can someone help me ?

	Guy Bottu


From ableasby at hgmp.mrc.ac.uk  Mon Oct 21 05:11:57 2002
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Mon, 21 Oct 2002 10:11:57 +0100 (BST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210911.KAA28060@bromine.hgmp.mrc.ac.uk>

Terminating full-stops are currently not part of the EMBOSS 
implementation of PROSITE patterns. Strictly they are,
although unnecessary, part of the PROSITE syntax so we
can accept them for future releases. For now if you just
omit the '.' the pattern will work.

Alan


From gbottu at ben.vub.ac.be  Mon Oct 21 05:36:05 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Mon, 21 Oct 2002 11:36:05 +0200 (CEST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210936.LAA1462808@black.vub.ac.be>

Without the '.' it does not give an error. I get : 
------------------
> fuzzpro
Protein pattern search
Input sequence(s): sw:pap?_carpa
Search pattern: <M(0,1)-A
Number of mismatches [0]:
Output report [pap2_carpa.fuzzpro]:
-------------------
the output file however turns out to be empty. Yet it should have found sw:papa_carpa, which 
starts with :   MAMI...

	Guy Bottu


From ableasby at hgmp.mrc.ac.uk  Mon Oct 21 05:50:21 2002
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Mon, 21 Oct 2002 10:50:21 +0100 (BST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210950.g9L9oLt03933@sulphur.hgmp.mrc.ac.uk>

We'll look into that. Looks to be a boundary condition
affecting zero length N terminal ranges.

Thanks

Alan


From simon.andrews at bbsrc.ac.uk  Mon Oct 21 10:00:35 2002
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Mon, 21 Oct 2002 15:00:35 +0100
Subject: Indexing Refseq
Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28753@bi-exsrv1.iapc.bbsrc.ac.uk>

I'm having all sorts of problems working with the latest release of RefSeq, due to a change in the way the files are being laid out.

In older releases of RefSeq the LOCUS identifier was the same as the accession number (eg NM_0123456), but in the latest version the LOCUS identifier is the gene identifier, and these aren't unique in the database!!

This means that when I run dbiflat (even using -idformat REFSEQ) I get a load of warnings about duplicate entries and when I later try to use the database I find that a load of entries are inaccessible because of this.

For example accessions NM_134265,NM_134264 and NM_015626 all have the ID WSB1.

How can I get dbiflat to index with the accession number as it's primary identifier so I don't lose entries when indexing them??

Thanks

Simon

PS This actually looks like a mistake by the RefSeq curators - I mean who thought that having a non-unique primary sequence identifier was a good idea!!!

--
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews at bbsrc.ac.uk
+44 (0)1223 496463 


From simon.andrews at bbsrc.ac.uk  Mon Oct 21 11:24:39 2002
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Mon, 21 Oct 2002 16:24:39 +0100
Subject: Indexing Refseq
Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28754@bi-exsrv1.iapc.bbsrc.ac.uk>

> -----Original Message-----
> From: simon andrews (BI) [mailto:simon.andrews at bbsrc.ac.uk]
> Subject: Indexing Refseq
> 
> 
> I'm having all sorts of problems working with the latest 
> release of RefSeq
>
> This means that when I run dbiflat (even using -idformat 
> REFSEQ) I get a load of warnings about duplicate entries and 
> when I later try to use the database I find that a load of 
> entries are inaccessible because of this.
> 
> For example accessions NM_134265,NM_134264 and NM_015626 all 
> have the ID WSB1.

Just to follow up to myself - I've found a temporary work-round for this problem.  The Bioperl script at the bottom of the message will pre-process the current Refseq files into a format which dbiflat can then index without errors.  You will see a warning from the NC_xxxx chromosome files in Refseq, but as these are only features with no sequence I wasn't too worried about them and just skipped them.

Usage of the script is "script_name [infile] > outfile".

	TTFN

	Simon.
-------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

# This script is a filter through which we can
# pass the whole of refseq. Newer versions of
# refseq replaced their locus ID with a string
# which wasn't the accession number.  This
# just changes them back.

my ($filename) = @ARGV;

die "No filename given" unless ($filename);

my $in = Bio::SeqIO -> new(-file => $filename,
			      -format => 'genbank');

die "Couldn't read $filename" unless ($in);

my $out = Bio::SeqIO -> new(-fh => \*STDOUT,
			    -format => 'genbank');

die "Couldn't make output pipe" unless ($out);

while (my $seq = $in -> next_seq()){

  # Some NC_xxx seqs are in the Refseq file
  # but don't have any sequence attached. We'll
  # skip those files...

  next if ($seq -> accession =~ /^NC/);

  $seq -> display_id($seq-> accession());

  $out -> write_seq($seq);

}
#-------------------------------------------------------


From jmuehlis at uni-muenster.de  Tue Oct 22 04:06:55 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 22 Oct 2002 10:06:55 +0200
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de>
			<3DABDCF0.6090809@uk.lionbioscience.com>
			<3DABE37C.B045FF6E@uni-muenster.de> <sc4u1joasw9.fsf@deskpro69.internal.sanger.ac.uk> <3DAC1D0B.3ACD64FD@uni-muenster.de>
Message-ID: <3DB5071F.EEB21A1@uni-muenster.de>

Hi,

Just for your information. This is the answer from my collaborators:

The sequence is a DNAStar  EditSeq file.  The notation indicates that
this 
sequence is consensus sequence from multiple reads put into a contig. 
If 
you do not have DNAStar, try to open with a wordprocessor program and
cut 
and paste the sequence into whatever sequence editor you use.  The
sequence 
uses standard nomenclature (ie. W = A or T; M = A or C; etc.....)

Thanks for your help.

As this format is not readable I will now just change the format by
other means.

Jorg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
Url : http://lists.open-bio.org/pipermail/emboss/attachments/20021022/d8b859d3/attachment.vcf 

From Andres.Aeschlimann at id.unibe.ch  Tue Oct 22 11:23:31 2002
From: Andres.Aeschlimann at id.unibe.ch (Andres Aeschlimann)
Date: Tue, 22 Oct 2002 17:23:31 +0200 (MET DST)
Subject: Cannot connect!
Message-ID: <Pine.GSO.4.21.0210221655360.14608-100000@ubecx01>


Hi all

Having installed jemboss for the first time. 
There's still a problem left:

After launching emboss from
http://ubecx04.unibe.ch:8080/jemboss/Jemboss.jnlp ( a trial campus emboss
server )

the webstart window appears as it should, and the login window as well, 
where username and password can be entered. Later on the window says

Cannot connect! and a window

"Check Public Server Settings" with the contents of the jemboss.properties
file:

user.auth=true
jemboss.server=true
server.public=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter
server.private=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter
service.public=JembossAuthServer
service.private=JembossAuthServer
plplot=/products/emboss/emboss/share/EMBOSS/
embossData=/products/emboss/emboss/share/EMBOSS/data/
embossBin=/products/emboss/emboss/bin/
embossPath=/usr/bin/:/bin:/packages/clustal/:/packages/primer3/bin:
acdDirToParse=/products/emboss/emboss/share/EMBOSS/acd/
embossURL=http://www.uk.embnet.org/Software/EMBOSS/Apps/

appears. soap-2_3_1 and jakarta-tomcat-4.1.12 are installed as described
in order to use with
ftp://ftp.hgmp.mrc.ac.uk/pub/EMBOSS/patchfiles/install-jemboss-server.sh

rpcrouter listens on https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter
: 

SOAP RPC Router

Sorry, I don't speak via HTTP GET- you have to use HTTP POST to talk to me.


ubecx04:/products/emboss.222 % java -version
java version "1.4.0_00"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_00-b05)
Java HotSpot(TM) Client VM (build 1.4.0_00-b05, mixed mode)

on Solaris 9.

Is there any log file where the cause would be explained? 

Thanks in advance for any hint.

Res
=========================================================
Dr. Andres Aeschlimann     Andres.Aeschlimann at id.unibe.ch
University of Berne
Gesellschaftsstrasse 6
CH-3012 BERNE              tel: +41 31 631 3845
Switzerland                fax: +41 31 631 3865


From gbottu at ben.vub.ac.be  Thu Oct 24 10:20:43 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Thu, 24 Oct 2002 16:20:43 +0200 (CEST)
Subject: questions about codon usage tables
Message-ID: <200210241420.QAA1196695@black.vub.ac.be>

from : BEN

	Dear colleagues,
	
I just took a look at codon usage tables under EMBOSS.

- there is a list of tables in .../share/EMBOSS/data/CODONS. Unfortunately, they 
have rather cryptic names. Is there a way to find out for which organism they 
are ? And from which data source do they come ?

- there is a program cutgextract. I tried it :

> cutgextract
Extract data from CUTG
CUTG directory [.]: /db/cutg	(here is the file cutg.dat)

But it does ... nothing. 

Anyone a clue ?

	Sincerely,
	Guy Bottu


From areagp61 at yahoo.it  Fri Oct 25 05:03:36 2002
From: areagp61 at yahoo.it (Graziano P.)
Date: Fri, 25 Oct 2002 11:03:36 +0200
Subject: -filter option for water and stretcher
Message-ID: <001e01c27c05$690c7520$18105709@italy.ibm.com>

Hi All,
I need to introduce sequences by standard input. I have found the -filter
qualifier in
the -help -verbose options. For example, if I use this qualifier for
"transeq" I write:
transeq -filter

then I have to insert my sequence (in fasta format for example) pasting or
writing it. When I have finished writing or pasting the sequences, I have to
press CTRL-D to terminate the standard input introduction. Finally  the
program return the standard output.

I have tried to use the -filter qualifier with "water" and "stretcher".
These two programs require two sequences in input in different files.
If I write as standard input:

>HTRE_ECOLI P33129 OUTER MEMBRANE USHER PROTEIN ...
PGVYDVSVYVNDQPIINQSITFVAIEGKKNAQACITLKNLLQFHINSPDINNEKAVLLAR
DETLGNCLNLTEIIPQASVRYDVNDQRLDIDVPQAWVMKNYQNYVDPSLWENGINAAMLS
NDQRLDIDVP

>YCJV_ECOLI P77481 HYPOTHETICAL ABC TRANSPORTER ...
MAQLSLQHIQKIYDNQVHVVKDFNLEIADKEFIVFVGPSGCGKSTTLRMIAGLEEISGGD
LLIDGKRMNDVPAKARNIAMVFQNYALYPHMTVYDNMAFGLKMQKIAKEVIDERVNWAAQ
KISVAELTGAEFMLYTTVGGTS

when I press CTRL-D I get the following error message:

Error: Unable to read sequence ''

How can I tell to standard input that what I paste or write are two
different sequences?
Is there any separator character that do it?

Best regards
Graziano

______________________________________________________________________
Scarica il nuovo Yahoo! Messenger: con webcam, nuove faccine e tante altre novit?.
http://it.yahoo.com/mail_it/foot/?http://it.messenger.yahoo.com/


From aralp001 at udcf.gla.ac.uk  Fri Oct 25 11:04:22 2002
From: aralp001 at udcf.gla.ac.uk (Dr Adam Ralph)
Date: Fri, 25 Oct 2002 16:04:22 +0100 (BST)
Subject: multi-page graphical output
In-Reply-To: <3DA4613B.3010901@uv.es>
Message-ID: <Pine.SOL.4.10.10210251549490.1853-100000@lenzie.cent.gla.ac.uk>


Dear Anyone,

   I am trying to write a program which outputs a graph, similar to 
plotcon or cpgplot. It would appear that the way these programs are
constructed, the graph is plotted on one page. Thus if you have a large
sequence the graph looks a bit of a mess. Other types of graphical program 
(like prettyplot) which plot lines of text are able to alter the number of
characters per line and produce multiple pages.
   My question is can someone show me or give me an example program
which splits histogram/graph plots into multiple pages? Thus on one 
page you can have a graph of residues 1-1000, then graph of 1001-2000 etc.

Thanks in advance
Adam


Dr. Adam Ralph
Institute of Virology
University of Glasgow
Church Street
Glasgow
G11 5JR

Phone: 0141 330 6268
Fax:   0141 337 2236 
email: a.ralph at vir.gla.ac.uk


From ggaz at cpqrr.fiocruz.br  Wed Oct  9 17:19:56 2002
From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli)
Date: Wed, 9 Oct 2002 18:19:56 -0300
Subject: jemboss
Message-ID: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br>

I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this. 
Could you help me?
Thanks,
Solange Busek
Centro de Pesquisas Ren? Rachou/FIOCRUZ

--
Esta mensagem foi "escaneada" pelo MailScanner a procura
de virus e codigo malicioso, e acredita-se que esteja "limpa".
Servico de Informatica - CPqRR/FIOCRUZ.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021009/9c1645cb/attachment.html 

From ggaz at cpqrr.fiocruz.br  Wed Oct  9 17:15:32 2002
From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli)
Date: Wed, 9 Oct 2002 18:15:32 -0300
Subject: jemboss
Message-ID: <000801c26fd9$9e235fe0$6500a8c0@cpqrr.fiocruz.br>

I would like to use the jemboss (interface java for emboss) but I need to enroll in HGPM and I don?t know how can I do this. Could you send me the email that I can do this?
Thanks,
Solange Busek
Centro de Pesquisas Ren? Rachou/FIOCRUZ

 
--
Esta mensagem foi "escaneada" pelo MailScanner a procura
de virus e codigo malicioso, e acredita-se que esteja "limpa".
Servico de Informatica - CPqRR/FIOCRUZ.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.open-bio.org/pipermail/emboss/attachments/20021009/0c18a953/attachment.html 

From tcarver at hgmp.mrc.ac.uk  Mon Oct 28 13:15:35 2002
From: tcarver at hgmp.mrc.ac.uk (Dr T. Carver)
Date: Mon, 28 Oct 2002 18:15:35 +0000 (GMT)
Subject: jemboss
In-Reply-To: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br>
Message-ID: <Pine.SOL.4.44.0210281813060.1860-100000@bromine>

Hi

You can register at the HGMP by filling out the form at:
http://www.hgmp.mrc.ac.uk/About/Registration/

Then send it to:
UK MRC HGMP Resource Centre
Hinxton
Cambridge
CB10 1SB
UK

You will then be sent an HGMP username and password.

Regards
Tim Carver

On Wed, 9 Oct 2002, Prof. Giovanni Gazzinelli wrote:

> I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this.
> Could you help me?
> Thanks,
> Solange Busek
> Centro de Pesquisas Ren? Rachou/FIOCRUZ
>
> --
> Esta mensagem foi "escaneada" pelo MailScanner a procura
> de virus e codigo malicioso, e acredita-se que esteja "limpa".
> Servico de Informatica - CPqRR/FIOCRUZ.
>
>


From David.Lapointe at umassmed.edu  Mon Oct 28 17:21:55 2002
From: David.Lapointe at umassmed.edu (Lapointe, David)
Date: Mon, 28 Oct 2002 17:21:55 -0500
Subject: Emboss on Solaris.
Message-ID: <13B2F22F9D5DD611B07700508BB1E88F019A2D7A@edunivexch02.umassmed.edu>

We've moved to a Netra T1 and I am having problems with the PNG libraries. I
get these runtime errors
using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am
I missing?


$ prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: opsin.msf
Graph type [x11]: png
libpng warning: Application was compiled with png.h from libpng-1.0.6
libpng warning: Application  is  running with png.c from libpng-1.2.4
gd-png:  fatal libpng error: Incompatible libpng version in application and
library

David Lapointe
Senior Informaticist / Information Services
Assistant Professor / Cell Biology
UMass Worcester
(508) 856-5141


From David.Bauer at SCHERING.DE  Tue Oct 29 01:37:00 2002
From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE)
Date: Tue, 29 Oct 2002 07:37:00 +0100
Subject: Antwort: Emboss on Solaris.
Message-ID: <OF42A961B2.334D459D-ONC1256C61.0023C200@schering.de>


Hi,

I also had some problems with this on Solaris.
Did you try to run configure with "--with-pngdriver=DIR"?.
This helps EMBOSS to pick the right header files.

David.


We've moved to a Netra T1 and I am having problems with the PNG libraries. I
get these runtime errors
using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am
I missing?


$ prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: opsin.msf
Graph type [x11]: png
libpng warning: Application was compiled with png.h from libpng-1.0.6
libpng warning: Application  is  running with png.c from libpng-1.2.4
gd-png:  fatal libpng error: Incompatible libpng version in application and
library

David Lapointe
Senior Informaticist / Information Services
Assistant Professor / Cell Biology
UMass Worcester
(508) 856-5141


From shibl at seqbio.com  Wed Oct 30 11:13:08 2002
From: shibl at seqbio.com (Shibl Mourad)
Date: Wed, 30 Oct 2002 11:13:08 -0500
Subject: Emboss Expert System
Message-ID: <002c01c2802f$3fec6370$2602a8c0@SEQUENCE>

Dear EMBOSS user,

We are currently developing an expert system that will complement EMBOSS.
As there are roughly 200 tools packaged within EMBOSS alone, the task to
locate the 'right' tool, especially if you are newcomer to the
bioinformatics field, can be overwhelming.

Our expert system, openExpert, aims to simulate the 'question and answer'
conversation one would have with a bioinformatics 'expert' -  but minus
their presence and wage.  Although it is currently populated with only the
EMBOSS suite, we aim to broaden the knowledge base of openExpert to
encompass all known bioinformatics tools.

We are looking for 5 EMBOSS users to review the system.  The review should
not take more than 30 minutes of your time and it would be of great value to
us.  If you are interested, please email shibl at seqbio.com.  If you would
like to try openExpert without providing a review, please indicate so in
your email and we will provide with free access.

Help us make openExpert a valuable expert system for bioinformatics.


Thank you,

Shibl Mourad,
President
Sequence Bioinformatics


From newgene at bigfoot.com  Thu Oct 31 12:43:06 2002
From: newgene at bigfoot.com (clwu)
Date: Thu, 31 Oct 2002 11:43:06 -0600
Subject: emboss in cygwin
Message-ID: <3DC16BAA.1050201@bigfoot.com>

Hi, group,
           I am new to group. I tried to compile EMBOSS under 
win2K/cygwin but I failed. EMBOSS website at HGMP mentioned that
"Richard Bruskiewich and Simon Kelley at the Sanger Centre have 
succeeded in compiling EMBOSS under Windows NT using the CygWin package. 
The resulting executables have been tested but not thoroughly enough for 
a release. Contact Richard Bruskiewich for more information. ". But I 
can not follow the link in this page to get help.
          Does anyone have the successful experience on this? Are there 
pre-complied executables for cygwin available, even part of those 
standalone programs? That will help me a lot.

Thank you in advance.


clwu


From fornai at biomed.unipi.it  Thu Oct  3 10:27:40 2002
From: fornai at biomed.unipi.it (Claudia Fornai)
Date: Thu, 3 Oct 2002 12:27:40 +0200
Subject: pepwindawall
Message-ID: <000301c26ac7$b5b26a00$060e7283@ttvgroup>

dear emboss
I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to usa from a suitable UNIX platform pepwindowall and aother programs.
Best regards,
Claudia Fornai


fornai at biomed.unipi.it

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021003/6f3c99a5/attachment-0001.html>

From letondal at pasteur.fr  Fri Oct  4 06:05:39 2002
From: letondal at pasteur.fr (Catherine Letondal)
Date: Fri, 04 Oct 2002 08:05:39 +0200
Subject: pepwindawall 
In-Reply-To: Your message of "Thu, 03 Oct 2002 12:27:40 +0200."
             <000301c26ac7$b5b26a00$060e7283@ttvgroup> 
Message-ID: <200210040605.g9465duY106667@electre.pasteur.fr>


"Claudia Fornai" wrote:
> 
> dear emboss
> I'm Claudia Fornai, and I'm writing from Italy. I'd like instruction to =
> usa from a suitable UNIX platform pepwindowall and aother programs.
> Best regards,
> Claudia Fornai
> 
> fornai at biomed.unipi.it
> 

Hi Claudia,

I guess that the documentation contains many answers to your question
but if you use the Web interface provided here:
http://bioweb.pasteur.fr/seqanal/interfaces/pepwindowall.html
You will have the Unix command corresponding to your
parameters displayed in the results page.

Other EMBOSS programs are available from here:
http://bioweb.pasteur.fr/intro-uk.html
(where there are not only EMBOSS programs though)

--
Catherine Letondal -- Pasteur Institute Computing Center


From squiresb at macrogenics.com  Fri Oct  4 17:48:47 2002
From: squiresb at macrogenics.com (Burke Squires)
Date: Fri, 04 Oct 2002 12:48:47 -0500
Subject: Primer prediction problems...
Message-ID: <B9C33EAF.327B%squiresb@macrogenics.com>

Hello all,

I am trying to use EMBOSS to predict PCR primers. I have tried downloading
the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar
file and the primer3.0.9 tar and installing them. I get errors about a
broken pipe or no primer3_core file found?

Can I trouble someone to point out an install document on a website that
lists a current set of instructions on installing EMBOSS and primer3 (or
another primer prediction program)?

Thanks in advance!

Burke Squires


From gwilliam at hgmp.mrc.ac.uk  Mon Oct  7 08:17:32 2002
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Mon, 07 Oct 2002 09:17:32 +0100
Subject: Primer prediction problems...
References: <B9C33EAF.327B%squiresb@macrogenics.com>
Message-ID: <3DA1431C.A139446B@hgmp.mrc.ac.uk>

The primer3_core program needs to be on your path before you can run
eprimer3.

Gary

Burke Squires wrote:
> 
> Hello all,
> 
> I am trying to use EMBOSS to predict PCR primers. I have tried downloading
> the Catapult installers for Mac OS X as well as downloading the V2.5.1 tar
> file and the primer3.0.9 tar and installing them. I get errors about a
> broken pipe or no primer3_core file found?
> 
> Can I trouble someone to point out an install document on a website that
> lists a current set of instructions on installing EMBOSS and primer3 (or
> another primer prediction program)?
> 
> Thanks in advance!
> 
> Burke Squires

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From avc at sanger.ac.uk  Mon Oct  7 10:50:40 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Mon, 07 Oct 2002 11:50:40 +0100
Subject: fasta splitter
Message-ID: <3DA16700.90280280@sanger.ac.uk>


Is there an emboss app to split a large fasta file into a set of smaller ones?
I'm combing the docs but can't see anything - it may be staring me in the
face...

thanks

Tony

-- 
##############################################################
Email: avc at sanger.ac.uk         # Webmaster,The Sanger Centre,
Tel: 01223 497512               # Hinxton, CAMBRIDGE CB10 1SA.
Fax: 01223 494919               # http://www.sanger.ac.uk/
##############################################################


From Thomas.Laurent at uk.lionbioscience.com  Mon Oct  7 11:02:02 2002
From: Thomas.Laurent at uk.lionbioscience.com (Thomas Laurent)
Date: Mon, 07 Oct 2002 12:02:02 +0100
Subject: fasta splitter
References: <3DA16700.90280280@sanger.ac.uk>
Message-ID: <3DA169AA.1040409@uk.lionbioscience.com>

Hi tony,
I think Splitter should do the job :
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html

Cheers,
Thomas

Tony Cox wrote:
> Is there an emboss app to split a large fasta file into a set of smaller ones?
> I'm combing the docs but can't see anything - it may be staring me in the
> face...
> 
> thanks
> 
> Tony
> 


From avc at sanger.ac.uk  Mon Oct  7 11:16:47 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Mon, 7 Oct 2002 12:16:47 +0100 (BST)
Subject: fasta splitter
In-Reply-To: <3DA169AA.1040409@uk.lionbioscience.com>
Message-ID: <Pine.OSF.4.44.0210071216180.879493-100000@cbi1a>

On Mon, 7 Oct 2002, Thomas Laurent wrote:

+>Hi tony,
+>I think Splitter should do the job :
+>http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/splitter.html

almost, but not quite. This converts one file to many files containg one
sequence. I need something like a conversion of one file containing 1000 seqs to
10 files each  containing 100 seqs

Tony


+>
+>Cheers,
+>Thomas
+>
+>Tony Cox wrote:
+>> Is there an emboss app to split a large fasta file into a set of smaller ones?
+>> I'm combing the docs but can't see anything - it may be staring me in the
+>> face...
+>>
+>> thanks
+>>
+>> Tony
+>>
+>
+>

******************************************************
Tony Cox			Email:avc at sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************


From jweiner1 at ix.urz.uni-heidelberg.de  Mon Oct  7 11:47:33 2002
From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3)
Date: Mon, 7 Oct 2002 13:47:33 +0200 (METDST)
Subject: fasta splitter
In-Reply-To: <Pine.OSF.4.44.0210071216180.879493-100000@cbi1a>
Message-ID: <Pine.A41.4.42.0210071340490.34202-101000@aixterm1.urz.uni-heidelberg.de>

Hello,

> almost, but not quite. This converts one file to many files containg one
> sequence. I need something like a conversion of one file containing 1000
> seqs to 10 files each  containing 100 seqs

I wrote you a simple perl script which should do the job.  Save it to a
file and make it executable (I think you are using a Unix-based system,
aren't you?) with chmod a+x split.pl.  To be on the safe side, put it in a
new directory, and copy your sequence file to the same directory.  Now run

./split.pl <filename> <number of sequences>

...where filename is the name of the file containing your 1000+ sequences,
and <number of sequences> is the number of sequences you wish to have in
each produced file.  The produced file will have the same name as the
original file with the appendix .1, .2, .3 etc.

I tried the script and it seems to work fine.  Meet the power of Perl :-)

Regards,
j.

----)-\//-///-----------------------------------January-Weiner-3-------
Technologists often forget the general user. Technology is only as good as
the user experience. That is something that technology groups very often
forget... [ Linus Torvalds, taken from the GNOME Usability Project ]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: split.pl
Type: application/x-perl
Size: 849 bytes
Desc: 
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021007/304bd4fb/attachment.pl>

From areagp61 at yahoo.it  Mon Oct  7 12:49:35 2002
From: areagp61 at yahoo.it (Graziano P.)
Date: Mon, 7 Oct 2002 14:49:35 +0200
Subject: Codon usage files
Message-ID: <000b01c26e00$03ee27f0$18105709@italy.ibm.com>

Hi all,
with backtranseq I can use different codon usage table selecting different
"codon usage
files" in the EMBOSS data path. Some files are self-explanating (for example
Ehuman.cut is the codon usage file name for Homo sapiens), but other files
are not so self-explanating
like Eacc.cut, Esma.cut, Eddi.cut, etc.
Is there any document that report informations about every file?

Thanks
Graziano Pappad?


______________________________________________________________________
Mio Yahoo!: personalizza Yahoo! come piace a te 
http://it.yahoo.com/mail_it/foot/?http://it.my.yahoo.com/


From md0nilhe at mdstud.chalmers.se  Mon Oct  7 13:10:33 2002
From: md0nilhe at mdstud.chalmers.se (Henrik Nilsson)
Date: Mon, 7 Oct 2002 15:10:33 +0200 (MET DST)
Subject: EMBASSY problem
Message-ID: <Pine.SOL.4.30.0210071508080.21060-100000@grosse.mdstud.chalmers.se>


Hello

I'm having major problems with compiling the PHYLIP package of EMBASSY.
Would anyone happen to have compiled it successfully on RedHat 7.3, and
would be willing to send me the executables?

hENRiK

--

          Written using

        VIM - Vi IMproved

           version 5.0

        http://www.vim.org


From ableasby at hgmp.mrc.ac.uk  Mon Oct  7 13:14:43 2002
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Mon, 7 Oct 2002 14:14:43 +0100 (BST)
Subject: Codon usage files
Message-ID: <200210071314.OAA29103@bromine.hgmp.mrc.ac.uk>

Not every file but most are described in the README file
from ftp://ftp.ebi.ac.uk/pub/databases/codonusage

You can use the EMBOSS program 'cutgextract' on the CUTG
database to get files with more meaningful (long) names.


Alan


From mathog at mendel.bio.caltech.edu  Mon Oct  7 14:49:05 2002
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Mon, 07 Oct 2002 07:49:05 -0700
Subject: fasta splitter
Message-ID: <E17yZBx-0003ii-00@mendel.bio.caltech.edu>


> 
> Is there an emboss app to split a large fasta file into a set of
smaller ones?
> I'm combing the docs but can't see anything - it may be staring me in the
> face...

This isn't an EMBOSS entry, but it will probably do what you want:

  ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c

There are some other fasta related utilities in the same directory.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From tmargus at ebc.ee  Mon Oct  7 17:54:12 2002
From: tmargus at ebc.ee (=?iso-8859-1?Q?T=F5nu_Margus?=)
Date: Mon, 7 Oct 2002 20:54:12 +0300
Subject: WWW  - Emma is not able to create SOME   temporary files
Message-ID: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee>

Hi,

I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, 
but emma didn not work correctly.

It gives an error:

Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file

It seems that by some reas?n it can not create a file under runs/temp directory.
Why not -  is for me unclea.  All other files are there. 

Files under catalog     runs/fileVxWbES$/) 

root at kobra:fileVxWbES$ ls -l
total 16
-rw-r--r--   1 www      java          915 Oct  7 20:51 8825A
-rw-r--r--   1 www      java            0 Oct  7 20:51 dendoutfile
-rw-r--r--   1 www      java          384 Oct  7 20:51 error
-rw-r--r--   1 www      java         2145 Oct  7 20:51 index.html
drwxr-xr-x   2 www      java         4096 Oct  7 20:51 input
-rw-r--r--   1 www      java            0 Oct  7 20:51 outseq

Command line clustalw works  ok
 
Is there a solution for this problem?


T?nu Margus 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021007/0f661e07/attachment-0001.html>

From starksb at ebi.ac.uk  Mon Oct  7 18:58:15 2002
From: starksb at ebi.ac.uk (David Starks-Browning)
Date: Mon,  7 Oct 2002 19:58:15 +0100
Subject: WWW  - Emma is not able to create SOME   temporary files
In-Reply-To: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee>
References: <009f01c26e2a$8baa70c0$1e1728c1@ebc.ee>
Message-ID: <5473-Mon07Oct2002195815+0100-starksb@ebi.ac.uk>

On Monday 7 Oct 02, T?nu Margus writes:
> Hi,
> 
> I am using EMBOSS via Luke McCarthy's web interface. All other programs are working, 
> but emma didn not work correctly.
> 
> It gives an error:
> 
> Error: failed to open filename 8808B Problem writing out EMBOSS alignment fileError: failed to open filename 8808B Problem writing out EMBOSS alignment file
> 
> It seems that by some reas?n it can not create a file under runs/temp directory.
> Why not -  is for me unclea.  All other files are there. 
> 
> Files under catalog     runs/fileVxWbES$/) 
> 
> root at kobra:fileVxWbES$ ls -l
> total 16
> -rw-r--r--   1 www      java          915 Oct  7 20:51 8825A
> -rw-r--r--   1 www      java            0 Oct  7 20:51 dendoutfile
> -rw-r--r--   1 www      java          384 Oct  7 20:51 error
> -rw-r--r--   1 www      java         2145 Oct  7 20:51 index.html
> drwxr-xr-x   2 www      java         4096 Oct  7 20:51 input
> -rw-r--r--   1 www      java            0 Oct  7 20:51 outseq
> 
> Command line clustalw works  ok

You don't show the permissions of the directory itself (use 'ls -la').
It's the directory permissions that determine whether files can be
created.

However, this may not be the problem.  We have seen problems with emma
on Linux, because the underlying application, clustalw, cannot deal
with filenames that are 5 characters long on Linux.  String buffer
management bugs in emma cause it to emit garbage characters after the
filename to the open() system call.  With emma, you will see this when
emma's PID is 4 digits long.  (You won't see the garbage characters in
error messages.  You only see them under strace.)

Clustalw should be fixed.  If that won't happen, emma.c could be
modified to pad the temporary file name with enough extra characters
so that, regardless of Linux PID, emma will use temp filenames longer
than 5 characters.

I don't have a patch for the latest version of emma, because I applied
the workaround to an old (1.9.1) version of EMBOSS.  Emma.c has changed a bit
since then, although the change is still straightforward to apply.

If you think this is your problem, I can provide details on how to
modify emma.c.

Hope this helps.

Kind regards,
David

 -------------------------------------------------------------------
  David Starks-Browning                  | starksb at ebi.ac.uk
  EMBL Outstation --                     |
  The European Bioinformatics Institute  |
  Wellcome Trust Genome Campus           | tel: +44 (1223) 494 616
  Hinxton, Cambridge, CB10 1SD, UK       | fax: +44 (1223) 494 468
 -------------------------------------------------------------------


From tcarver at hgmp.mrc.ac.uk  Tue Oct  8 08:44:34 2002
From: tcarver at hgmp.mrc.ac.uk (Tim Carver)
Date: Tue, 08 Oct 2002 09:44:34 +0100
Subject: Jemboss Server Feedback
Message-ID: <3DA29AF2.BE48E5F5@hgmp.mrc.ac.uk>


It would be immensely useful if those who have setup a Jemboss server
could
provide some feedback to us. This is useful in providing some ideas for
the
future direction of its development and to give our funding body some
idea
of its usage at other sites. In particular the following information
would
be of use:


1. Nationality
2. Funding body and/or Organisation
3. Server Platform  O/S (linux, solaris, MacOSX, AIX, HP-UX....)
4. Type of installation - e.g. with unix authorisation
5. Number of users at your site using Jemboss
6. Comments
     - what where you using before & why you changed
     - likes, dislikes & suggestions for Jemboss development (server &
client)


Many thanks in advance,
Tim Carver

HGMP-RC


From mq1 at sanger.ac.uk  Tue Oct  8 13:00:06 2002
From: mq1 at sanger.ac.uk (Mike Quail)
Date: Tue, 8 Oct 2002 14:00:06 +0100
Subject: restriction mapping
Message-ID: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk>

Hi

I am currently looking to isolate restriction fragments that cover gaps that are left in several genomes. To do this I need to cut the sequence we have of a genome with all known database enzymes and then select those that just cut a few times and in the right place so as to excise the region of the genome I require. 

GCG programs map and mapplot were excellent for doing this. Map in particular is good as it gives a graphical plot for each enzyme (one enzyme per line) plotting all the enzymes on a page or two so you can rapidly see which is appropriate.

I have tried the EMBOSS programs and basically they are no use. REMAP does what I want but in too great detail (the output would stretch round the globe) and RESTRICT is too unordered in its output. 

I have got a program called oligo on my PC that will do this, BUT it has problems with big sequences. Recently I tried analysing a 1.5Mb chromosome and it would only work if I limited the number of enzymes to 6 or less. So I could transfer the data over to my PC and try with that but as this organism is 5Mb it will be very slow going.

Have you any ideas of how this could be done in EMBOSS.

M.Quail


Project Leader

Wellcome Trust Sanger Institute

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021008/3f570826/attachment-0001.html>

From peter.rice at uk.lionbioscience.com  Tue Oct  8 13:30:31 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 14:30:31 +0100
Subject: restriction mapping
References: <000d01c26eca$a16bc940$6d1019ac@internal.sanger.ac.uk>
Message-ID: <3DA2DDF7.70604@uk.lionbioscience.com>

Mike Quail wrote:
> I am currently looking to isolate restriction fragments that cover gaps 
> that are left in several genomes. To do this I need to cut the sequence 
> we have of a genome with all known database enzymes and then select 
> those that just cut a few times and in the right place so as to excise 
> the region of the genome I require.
> 
> Have you any ideas of how this could be done in EMBOSS.

You just need to know the enzymes that only cut twice, for example?

% restrict -min 2 -max 2 -plasmid

(the -plasmid may look odd, but it means "circular DNA" and says nothing 
about the size :-)

You can also check each enzyme one at a time afterwards:

% restrict -plasmid -fragment -enzyme BssHI

... the -fragment option includes the fragment sizes at the end of the 
report. You will need the positions and the fragment sizes to choose an enzyme.

You can select other report formats (-rformat), but the default is probably 
the most useful for your case (-rformat EMBL or GFF, for example, will miss 
the -fragment output)

Meanwhile, a graphical view could be nice so you can look for restriction 
sites on screen.
We can look into that.

Hope this helps,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From jonas.andersson at rocketmail.com  Tue Oct  8 14:35:50 2002
From: jonas.andersson at rocketmail.com (Jonas Andersson)
Date: Tue, 8 Oct 2002 07:35:50 -0700 (PDT)
Subject: Not compiling?
Message-ID: <20021008143550.40367.qmail@web40110.mail.yahoo.com>

When I try to compile the latest EMBOSS this is what I get. What do I
do wrong, given that I do as is suggested on the EMBOSS pages?


-MT ajreport.lo -MD -MP -MF .deps/ajreport.TPlo -o ajreport.o
>/dev/null 2>&1
make[1]: *** [ajreport.lo] Error 1
make[1]: Leaving directory `/home/henrik/temp/emboss/EMBOSS-2.5.1/ajax'
make: *** [all-recursive] Error 1

/ Jonas

__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com


From avc at sanger.ac.uk  Tue Oct  8 15:10:22 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Tue, 8 Oct 2002 16:10:22 +0100 (BST)
Subject: fasta splitter 
In-Reply-To: <Pine.A41.4.42.0210081639270.31834-100000@aixterm1.urz.uni-heidelberg.de>
Message-ID: <Pine.OSF.4.44.0210081607100.919344-100000@cbi1a>

On Tue, 8 Oct 2002, January Weiner 3 wrote:

Thanks to all that responded. I did, in the end write a 12 line bioperl script
to split my fasta file. My request seems, however, to highlight a small blind
spot on the EMBOSS radar. It appears that there are a number of implementations
out there - perhaps one of them can be donated to the emboss project as the
basis of a new software tool?

Tony

+>Hi,
+>
+>> This is apparently something that is frequently asked by biologists.
+>> If you call it fastasplitter, I have a Web interface ready for it:
+>> http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html
+>> If you think it's interesting, I install it, and in such case, I will
+>> put your name (J. Weiner ?) on the Web interface.
+>
+>No problem, do it, it's freeware (not even GPL :-).  However, if you think
+>that such a tool is useful, then I'll rewrite it in C -- to make it faster.
+>If I may suggest -- it'd be nice if you could download or get the produced
+>files as a tgz or zip archive.
+>
+>j.
+>
+>----)-\//-///-----------------------------------January-Weiner-3-------
+>Wysz?a Ho?? i Czyst?, wr?ci?a Wsp?ln? i Nieca?? [ (C) by moja babcia ]
+>
+>

******************************************************
Tony Cox			Email:avc at sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************


From Joerg.Schaber at uv.es  Tue Oct  8 15:58:11 2002
From: Joerg.Schaber at uv.es (Joerg Schaber)
Date: Tue, 08 Oct 2002 17:58:11 +0200
Subject: loading DDBJ data into EMBOSS
Message-ID: <3DA30093.6080404@uv.es>

Hi,

i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. 
ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using 
'dbiflat -idformat gb'. I get a warning for all entries in the flatfile
'Warning: Duplicate ID skipped: '<null>' All hits will point to first ID 
found? and I can not retrieve any sequence. I think dbiflat only 
recognizes the first entry.
When I download the corresponding fasta flatfile I have no problems 
creating an EMBOSS database using 'dbifasta'. However, I would like to 
use the original DDBJ flatfile because it includes more information.
Any idea what's the problem?

greetings,

joerg


From peter.rice at uk.lionbioscience.com  Tue Oct  8 16:08:47 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 17:08:47 +0100
Subject: loading DDBJ data into EMBOSS
References: <3DA30093.6080404@uv.es>
Message-ID: <3DA3030F.2030808@uk.lionbioscience.com>

Joerg Schaber wrote:
> Hi,
> 
> i have problems creating an EMBOSS database from a DDBJ flatfile (e.g. 
> ftp://ftp.genome.ad.jp/pub/kegg/genomes/genes/Buchnera.ent) using 
> 'dbiflat -idformat gb'. I get a warning for all entries in the flatfile
> 'Warning: Duplicate ID skipped: '<null>' All hits will point to first ID 
> found? and I can not retrieve any sequence. I think dbiflat only 
> recognizes the first entry.
> When I download the corresponding fasta flatfile I have no problems 
> creating an EMBOSS database using 'dbifasta'. However, I would like to 
> use the original DDBJ flatfile because it includes more information.
> Any idea what's the problem?

Yes ... that file is not in Genbank or DDBJ format!!!!

It looks more like a CODATA format, but only the ENTRY is recognized.
If you can find a name for it, we could probably implements a new 
input/output sequence format ... but it has some horrible features that 
will not be general.

Example entry:

ENTRY       BU002             CDS       Buchnera
NAME        atpB
DEFINITION  ATP synthase A chain [EC:3.6.3.14] [SP:ATP6_BUCAI]
CLASS       Metabolism; Energy Metabolism; Oxidative phosphorylation
             [PATH:buc00190]
             Metabolism; Energy Metabolism; ATP synthesis [PATH:buc00193]
             Metabolism; Energy Metabolism; Photosynthesis [PATH:buc00195]
POSITION    2278..3102
DBLINKS     RIKEN: BU002
             NCBI: 10038695
CODON_USAGE       T               C               A               G
           T  27   2  22   7  11   0   7   1   7   1   1   0   1   0   0   5
           C   4   0   3   2   6   1   4   2   5   1   8   2   1   0   2   0
           A  28   0   5  12   5   0   3   0   7   3  13   1   4   1   0   0
           G   4   1  12   3   5   1   5   0   8   0   7   1   7   2   4   0
AASEQ       274
             MILEKISDPQKYISHHLSHLQIDLRSFKIIQPGALSSDYWTVNVDSMFFSLVLGSFFLSI
             FYMVGKKITQGIPGKLQTAIELIFEFVNLNVKSMYQGKNALIAPLSLTVFIWVFLMNLMD
             LVPIDFFPFISEKVFELPAMRIVPSADINITLSMSLGVFFLILFYTVKIKGYVGFLKELI
             LQPFNHPVFSIFNFILEFVSLVSKPISLGLRLFGNMYAGEMIFILIAGLLPWWTQCFLNV
             PWAIFHILIISLQAFIFMVLTIVYLSMASQSHKD
NTSEQ       825
             atgattttagaaaagatatctgatcctcaaaaatatattagtcatcatttaagtcacttg
             cagatagatttgcgttcttttaaaattattcaaccaggtgcattgtcttctgattattgg
             actgtaaatgttgattcaatgtttttttctcttgtactgggtagtttttttttaagtatt
             ttttatatggtaggaaaaaaaattactcaaggtataccaggtaaattacaaactgcaatt
             gagttaatttttgaatttgtaaatttaaatgtaaaaagcatgtatcaaggtaaaaatgct
             cttattgcacctttatcattaacagtatttatttgggtttttttaatgaatctaatggat
             ttagttccgattgatttctttccatttatttctgaaaaagtgtttgaattacctgctatg
             cgaattgtaccttctgctgatattaatattacactatcaatgtcacttggcgtgtttttt
             ttaattttattttatactgttaaaattaaaggatatgtaggctttttaaaagaacttatt
             ttacaacctttcaaccatcctgtattttctatttttaattttatattagaatttgtgtca
             ttggtctcgaaacccatttctttgggattgcgattatttggaaacatgtacgcaggtgaa
             atgatttttattttaattgcaggtttgctgccatggtggacacaatgttttttaaacgta
             ccgtgggctatttttcatattttaataatttcactacaggcttttatttttatggtatta
             actattgtatatttatcaatggcctctcaatctcataaagattaa
///


-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From peter.rice at uk.lionbioscience.com  Tue Oct  8 16:37:36 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 17:37:36 +0100
Subject: fasta splitter
References: <Pine.OSF.4.44.0210081607100.919344-100000@cbi1a>
Message-ID: <3DA309D0.7@uk.lionbioscience.com>

Tony Cox wrote:
> On Tue, 8 Oct 2002, January Weiner 3 wrote:
> 
> Thanks to all that responded. I did, in the end write a 12 line bioperl script
> to split my fasta file. My request seems, however, to highlight a small blind
> spot on the EMBOSS radar. It appears that there are a number of implementations
> out there - perhaps one of them can be donated to the emboss project as the
> basis of a new software tool?

Nobody suggested hacking "seqret" to do what you want...

One problem doing this in EMBOSS is the need to generate filenames for your 
  split files - but maybe a base filename would be enough to generate 
names. Then all you need to do is count sequences in a modified seqret.c 
and change the output file. You can add a command line option for the 
number of sequences in an output file. Cleaning up output files for a rerun 
is an exercise for the user (unless you want to invent a new ACD type that 
does it :-)

Needs a modified version of the seqFileReopen function to handle the file 
naming, but nothing complicated is involved.

regards.

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From avc at sanger.ac.uk  Tue Oct  8 16:39:31 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Tue, 8 Oct 2002 17:39:31 +0100 (BST)
Subject: fasta splitter
In-Reply-To: <3DA309D0.7@uk.lionbioscience.com>
Message-ID: <Pine.OSF.4.44.0210081738450.919344-100000@cbi1a>

On Tue, 8 Oct 2002, Peter Rice wrote:

that sounds excellent - does this mean it really will make it in to the EMBOSS
release? (any idea when? ;)

Tony


+>Tony Cox wrote:
+>> On Tue, 8 Oct 2002, January Weiner 3 wrote:
+>>
+>> Thanks to all that responded. I did, in the end write a 12 line bioperl script
+>> to split my fasta file. My request seems, however, to highlight a small blind
+>> spot on the EMBOSS radar. It appears that there are a number of implementations
+>> out there - perhaps one of them can be donated to the emboss project as the
+>> basis of a new software tool?
+>
+>Nobody suggested hacking "seqret" to do what you want...
+>
+>One problem doing this in EMBOSS is the need to generate filenames for your
+>  split files - but maybe a base filename would be enough to generate
+>names. Then all you need to do is count sequences in a modified seqret.c
+>and change the output file. You can add a command line option for the
+>number of sequences in an output file. Cleaning up output files for a rerun
+>is an exercise for the user (unless you want to invent a new ACD type that
+>does it :-)
+>
+>Needs a modified version of the seqFileReopen function to handle the file
+>naming, but nothing complicated is involved.
+>
+>regards.
+>
+>Peter
+>
+>--
+>------------------------------------------------
+>Peter Rice, LION Bioscience Ltd, Cambridge, UK
+>peter.rice at uk.lionbioscience.com +44 1223 224723
+>

******************************************************
Tony Cox			Email:avc at sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************


From peter.rice at uk.lionbioscience.com  Tue Oct  8 16:42:39 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 08 Oct 2002 17:42:39 +0100
Subject: fasta splitter
References: <Pine.OSF.4.44.0210081738450.919344-100000@cbi1a>
Message-ID: <3DA30AFF.3010100@uk.lionbioscience.com>

Hi Tony

> that sounds excellent - does this mean it really will make it in to the EMBOSS
> release? (any idea when? ;)

I already have the first part of the code ... a modified "seqret" to split 
into 10 sequences per file.

Working copy is called "tenco" :-)

What did you have in mind as a naming convention for the output files? The 
existing code names each file after the first sequence, I guess you want 
"outfile.1" "outfile.2" and so on, possibly with leading zeroes 
"outfile.,001" etc.

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From letondal at pasteur.fr  Tue Oct  8 18:11:11 2002
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue, 08 Oct 2002 20:11:11 +0200
Subject: fasta splitter 
In-Reply-To: Your message of "Mon, 07 Oct 2002 13:47:33 +0200."
             <Pine.A41.4.42.0210071340490.34202-101000@aixterm1.urz.uni-heidelberg.de> 
Message-ID: <200210081811.g98IBBuY253618@electre.pasteur.fr>


January Weiner 3 wrote:
> 
> Hello,
> 
> > almost, but not quite. This converts one file to many files containg one
> > sequence. I need something like a conversion of one file containing 1000
> > seqs to 10 files each  containing 100 seqs
> 
> I wrote you a simple perl script which should do the job.  Save it to a
> file and make it executable (I think you are using a Unix-based system,
> aren't you?) with chmod a+x split.pl.  To be on the safe side, put it in a
> new directory, and copy your sequence file to the same directory.  Now run
> 
> ./split.pl <filename> <number of sequences>
> 
> ...where filename is the name of the file containing your 1000+ sequences,
> and <number of sequences> is the number of sequences you wish to have in
> each produced file.  The produced file will have the same name as the
> original file with the appendix .1, .2, .3 etc.
> 
> I tried the script and it seems to work fine.  Meet the power of Perl :-)

For information, I have installed this script and the program is
available at: http://bioweb.pasteur.fr/seqanal/interfaces/fastasplitter.html

--
Catherine Letondal -- Pasteur Institute Computing Center


From avc at sanger.ac.uk  Tue Oct  8 18:38:50 2002
From: avc at sanger.ac.uk (Tony Cox)
Date: Tue, 8 Oct 2002 19:38:50 +0100
Subject: fasta splitter
References: <Pine.OSF.4.44.0210081738450.919344-100000@cbi1a> <3DA30AFF.3010100@uk.lionbioscience.com>
Message-ID: <000d01c26ef9$f206f710$0a00a8c0@zeus>


----- Original Message -----
From: "Peter Rice" <peter.rice at uk.lionbioscience.com>
To: "Tony Cox" <avc at sanger.ac.uk>
Cc: "January Weiner 3" <jweiner1 at ix.urz.uni-heidelberg.de>;
<pise at pasteur.fr>; <emboss at embnet.org>
Sent: Tuesday, October 08, 2002 5:42 PM
Subject: Re: fasta splitter


> Hi Tony
>
> > that sounds excellent - does this mean it really will make it in to the
EMBOSS
> > release? (any idea when? ;)
>
> I already have the first part of the code ... a modified "seqret" to split
> into 10 sequences per file.
>
> Working copy is called "tenco" :-)
>
> What did you have in mind as a naming convention for the output files? The
> existing code names each file after the first sequence, I guess you want
> "outfile.1" "outfile.2" and so on, possibly with leading zeroes
> "outfile.,001" etc.

Hi Peter,

This sounds great to me. Personally, I'd prefer not to have the leading
zeros - just an incrementing ".[integer]" appended to the filename supplied.
Makes shell manipulation easier.

I guess the ideal would able to supply either a number of chunks to split
the file in to or else specify a maximum size (either in bytes or fasta
entries) for each chunk.

cheers

Tony


From mathog at mendel.bio.caltech.edu  Tue Oct  8 19:00:12 2002
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Tue, 08 Oct 2002 12:00:12 -0700
Subject: fasta splitter
Message-ID: <E17yzaW-0004O6-00@mendel.bio.caltech.edu>

There's more than one way to split a fasta file...

1.  Split M entries into N files, file 1 receives 1->M/N,
file 2 receives M/N+1->2M/N, etc. Advantages - only one
file needs to be open at a time, simple.  Disadvantage -
the resulting split is typically uneven.  Do this with the
NCBI databases and you'll find that they are heavily weighted
for smaller sequences at the beginning and longer ones at the
end.  If the point of the split is to load balance (this is
what I use it for, with parallel BLAST) some nodes will finish
much earlier than others. Implementation: (deleted, I found
this method not to be generally useful)

1b.  head/tail/segment entries out of a fasta file.  While (1)
caused a lot of problems I've often needed to chop out a specific
part of a fasta file.  Why?  Because some piece of software was
blowing up on the 351,234 entry, but only if preceded by several
thousand other entries. Finding the smallest piece that will trigger the
bug can save hours of run time debugging these sorts of problems. 
Implementation:

   ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c

2.  Split M entries into N files, cycling output to each file.
That is, entry M goes to file M modulo N.  Advantage - resulting
files tend to be more even in size.  Disadvantage - N output files
must be open at once (or you have to cycle through N times, once
per phase); if M is small and the size of each entry large the
resulting files will not generally be balanced.  Example, splitting
the yeast genome, heaven help us when full length human chromosomes
start showing up as single FASTA file entries. Implementation:

  ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c

3.  Split P bases in M entries into N files "evenly", fragmenting
sequences if they are too large.  Advantage:  fixes the genome
data problem from (2). Disadvantages:  even more complex than
(2) and "entries" in resulting files do not correspond one to
one with the original. Even with clever naming conventions 
(yeastII_100001_200000) end users will be confused.  Clever
names will be truncated by most software at the worst possible
place resulting in a "hit" on "yeastII_" :-(.  Implemenation:
(well, partially, this one translates in all 6 frames, but
it has some of the naming/fragmenting features):

   ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c

4.  Split by content.  Ie, strip all the human sequences out
of nr.  I don't beleive there is a general solution because there
is no univerasally agreed upon FASTA header line format.
Implementation:  SRS or something similar.


Regards,


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From squiresb at macrogenics.com  Tue Oct  8 19:20:35 2002
From: squiresb at macrogenics.com (Burke Squires)
Date: Tue, 08 Oct 2002 14:20:35 -0500
Subject: eprimer3...broken pipe?
In-Reply-To: <E17yzaW-0004O6-00@mendel.bio.caltech.edu>
Message-ID: <B9C89A33.33C4%squiresb@macrogenics.com>

I have tried to install various version of emboss and when I try and run
eprimer3 I get the following message:

[loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3
Picks PCR primers and hybridization oligos
Input sequence(s): /bioinfo/fragments.fa
Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out

   EMBOSS An error in eprimer3.c at line 317:
The program 'primer3_core' must be on the path.
It is part of the 'primer3' package,
available from the Whitehead Institute.
See: http://www-genome.wi.mit.edu/
Broken pipe


Does anybody know how to fix this?

Thanks!

Burke Squires

-- 
Burke Squires
Bioinformatics
MacroGenics, Inc.
2600 Stemmons Freeway, Suite 210
Dallas, TX 75235 USA
Work: 214-634-3000 X224
Squiresb @ macrogenics.com (Please remove spaces to use)
www.macrogenics.com
----------------------------------------------------------------------------
This e-mail and any attachments may be confidential or legally privileged.
If you received this message in error or are not the intended recipient, you
should destroy the e-mail message and any attachments or copies, and you are
prohibited from retaining, distributing, disclosing or using any information
contained herein.  Please inform us of the erroneous delivery by return
e-mail. 

Thank you for your cooperation. 


From tchiang at bioinfo.sickkids.on.ca  Tue Oct  8 19:31:18 2002
From: tchiang at bioinfo.sickkids.on.ca (Ted Chiang)
Date: Tue, 8 Oct 2002 15:31:18 -0400 (EDT)
Subject: cusp
Message-ID: <Pine.GSO.4.05.10210081522430.8600-100000@kenny>


Hi,

I have question about the Emboss program cusp.  The program creates a
codon usage table based on the "coding" sequence of the input file.  My
question is how does it determine where the 'coding' (or ORF) sequence
given any DNA sequence when one executes the program without specifying
the -sbeg and -send flags.

ie.

$cusp dna_seq  

How does cusp determine where the coding sequence begins?

As opposed to 

$cusp dna_seq -sbegin 135 -send 192

where the coding sequence is specified.  In the latter case, how does if
the specified region is not divisible by 3, does cusp ignore the latter
few nucleotides?


Thanks.

-Ted


=====================================
Ted Chiang, Analyst
Centre for Computational Biology 
Hospital for Sick Children, Toronto
416.813.7028
tchiang at bioinfo.sickkids.on.ca
=====================================


From sebastian.bassi at ar.advantaseeds.com  Tue Oct  8 20:05:43 2002
From: sebastian.bassi at ar.advantaseeds.com (Sebastian Bassi)
Date: Tue, 8 Oct 2002 22:05:43 +0200
Subject: fasta splitter
Message-ID: <BF7C473F341130418CAE514276C099DB06412A@e2knl1.nl.seedsnetwork.com>

> What did you have in mind as a naming convention for the 
> output files? The 
> existing code names each file after the first sequence, I 
> guess you want 
> "outfile.1" "outfile.2" and so on, possibly with leading zeroes 
> "outfile.,001" etc.

My $.02: I think that outfile.[number_here] is not a good convention, since the extension (whatever you put after the dot) means the file type, and here the file type is always the same (ASCII text). I think it should be something like:
outfile_[number].txt
It should look like this:
outfile_1.txt
outfile_2.txt
outfile_3.txt

Anyway, IANAP (I am not a programmer) I'm just an end user and I'm stating this from a user consistency view point. If I have two mp3 files (xsongpart1 and xsongpart2) I would name them as part1.mp3 and part2.mp3 and NOT xsong.1 and xsong.2


From mathog at mendel.bio.caltech.edu  Tue Oct  8 20:33:56 2002
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Tue, 08 Oct 2002 13:33:56 -0700
Subject: fasta splitter
Message-ID: <E17z13E-0004aH-00@mendel.bio.caltech.edu>


> > What did you have in mind as a naming convention for the 
> > output files? The 
> > existing code names each file after the first sequence, I 
> > guess you want 
> > "outfile.1" "outfile.2" and so on, possibly with leading zeroes 
> > "outfile.,001" etc.
> 
> My $.02: I think that outfile.[number_here] is not a good convention,
since the extension (whatever you put after the dot) means the file
type, and here the file type is always the same (ASCII text). I think it
should be something like:
> outfile_[number].txt
> It should look like this:
> outfile_1.txt


I agree.  Also the numeric range should be displayed
in a fixed column width.  Ideally something like:

  % esplit \
     -sequence=ncbi_nr.nfa \
     -fmask='nr_frag_####.nfa' \
     -spitn=20 \
     -splitmode=cycle \
     -numberfrom=0

would produce

nr_frag_0000.nfa
...
nr_frag_0019.nfa

Keeping the names fixed width prevents all sorts of
text alignment problems which can show up otherwise.

Regards,


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From fernan at iib.unsam.edu.ar  Tue Oct  8 22:51:16 2002
From: fernan at iib.unsam.edu.ar (Fernan Aguero)
Date: Tue, 8 Oct 2002 19:51:16 -0300
Subject: fasta splitter
In-Reply-To: <3DA309D0.7@uk.lionbioscience.com>
References: <Pine.OSF.4.44.0210081607100.919344-100000@cbi1a> <3DA309D0.7@uk.lionbioscience.com>
Message-ID: <20021008225116.GA273@iib.unsam.edu.ar>

+----[ Asi hablaba Peter Rice (peter.rice at uk.lionbioscience.com):
|

[ snipped ]

| 
| One problem doing this in EMBOSS is the need to generate filenames for your 
|  split files - but maybe a base filename would be enough to generate 
| names. 

Now let me get myself into the discussion. The splitter I use is
called 'shatter' and is part of the SEALS package, which I
guess is unmaintained (and perhaps obsolete?)  and is
basically perl. 
ftp://ftp.ncbi.nih.gov/pub/walker/seals/software

The following discussion works for splitting into individual
sequences, but not into groups of sequences. In this case a
different naming scheme should be used, (though perhaps the
same argument specifier '-word' could be used?). 

The approach of shatter (both for splitting FASTA files, but
also for splitting concatenated BLAST reports, which are
splitted by 'shatterblast') is to let you choose the 'word'
which will be used as a basename. Both shatters know about
the NCBI FASTA standard and thus, given a FASTA header like
the following:
>gi|123456|gb|AA123456|AA123456.1 Homo sapiens protein X etc

will take the gi as word 2 (123456), the accession number
(AA123456) as word 4, the accession.version (AA123456.1) as
word 5 and so on. 

In the command-line you just say 'shatter -word 1 fastafile'
if you want the first word after the '>' to be the basename.

This produces files with that basename and terminated in .fa

The program will consider whitespace and the character '|'
as word delimiters.

In my own experience this is a good thing. I've used shatter
with many different FASTA flavours and adjusting the word to
be used as basename is plain easy.

BLAST reports are also trivial since query sequences, are
also usually in FASTA format, and you get basically the same
header, though after the 'Query=' magic word. In this case
you get files with the same basename, but ending in .br

Just my 2 cents. Hope this makes it into EMBOSS.

Fernan


| and change the output file. You can add a command line option for the 
| number of sequences in an output file. Cleaning up output files for a rerun 
| is an exercise for the user (unless you want to invent a new ACD type that 
| does it :-)
| 
| Needs a modified version of the seqFileReopen function to handle the file 
| naming, but nothing complicated is involved.
| 
| regards.
| 
| Peter
| 
| -- 
| ------------------------------------------------
| Peter Rice, LION Bioscience Ltd, Cambridge, UK
| peter.rice at uk.lionbioscience.com +44 1223 224723
| 
| 
|
+----]

-- 
F e r n a n   A g u e r o
http://genoma.unsam.edu.ar/~fernan


From gwilliam at hgmp.mrc.ac.uk  Wed Oct  9 08:28:08 2002
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Wed, 09 Oct 2002 09:28:08 +0100
Subject: eprimer3...broken pipe?
References: <B9C89A33.33C4%squiresb@macrogenics.com>
Message-ID: <3DA3E898.C9318CEE@hgmp.mrc.ac.uk>

>From the eprimer3 documentation:

The Whitehead program must be set up and on the path in order for
eprimer3 to find and run it. 

The Whitehead Institute program that is run by this program is available
from:
http://www-genome.wi.mit.edu/genome_software/other/primer3.html 
(Then see the link 'Get release 0.9') 

The version that is run by this program is 3.0.9 currently available
from:
http://www-genome.wi.mit.edu/ftp/distribution/software/primer3_0_9_test.tar.gz 

Gary


Burke Squires wrote:
> 
> I have tried to install various version of emboss and when I try and run
> eprimer3 I get the following message:
> 
> [loopback:bioinfo/emboss-2.4.1/emboss] bsquires% eprimer3
> Picks PCR primers and hybridization oligos
> Input sequence(s): /bioinfo/fragments.fa
> Output file [tpe-v_a.eprimer3]: /bioinfo/fa.out
> 
>    EMBOSS An error in eprimer3.c at line 317:
> The program 'primer3_core' must be on the path.
> It is part of the 'primer3' package,
> available from the Whitehead Institute.
> See: http://www-genome.wi.mit.edu/
> Broken pipe
> 
> Does anybody know how to fix this?
> 
> Thanks!
> 
> Burke Squires
> 
> --
> Burke Squires
> Bioinformatics
> MacroGenics, Inc.
> 2600 Stemmons Freeway, Suite 210
> Dallas, TX 75235 USA
> Work: 214-634-3000 X224
> Squiresb @ macrogenics.com (Please remove spaces to use)
> www.macrogenics.com
> ----------------------------------------------------------------------------
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient, you
> should destroy the e-mail message and any attachments or copies, and you are
> prohibited from retaining, distributing, disclosing or using any information
> contained herein.  Please inform us of the erroneous delivery by return
> e-mail.
> 
> Thank you for your cooperation.

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From Joerg.Schaber at uv.es  Wed Oct  9 17:02:51 2002
From: Joerg.Schaber at uv.es (Joerg Schaber)
Date: Wed, 09 Oct 2002 19:02:51 +0200
Subject: swissprot
Message-ID: <3DA4613B.3010901@uv.es>

Hi,

can't load the SWISSPROT- bacteria database 
(ftp://ftp.ebi.ac.uk/pub/databases/swissprot/special_selections/bacteria.seq) 
into EMBOSS. I think EMBOSS is running well because I have no problem 
accessing the test-databases (see showdb below). However, I think 
somehow seqret is using the wrong division file but the PATH-setting 
seem to be correct.

greetings,

joerg
 

 > dbiflat
Index a flat file database
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: SWISS
Database directory [.]:
Wildcard database filename [*.dat]: *.seq
Database name: swissbac
Release number [0.0]: 1.0
Index date [00/00/00]: 09/10/02

 > ll
insgesamt 132100
 950883 drwxrwxr-x    2 root     users        4096 Okt  9 18:50 .
 623533 drwxrwxr-x    5 jos      jos          4096 Okt  9 17:50 ..
 950889 -rw-r--r--    1 jos      jos        189028 Okt  9 18:50 acnum.hit
 950888 -rw-r--r--    1 jos      jos        660456 Okt  9 18:50 acnum.trg
 623548 -rw-r--r--    1 jos      jos      133412511 Okt  9 18:25 
bacteria.seq
 950886 -rw-r--r--    1 jos      jos           322 Okt  9 18:50 division.lkp
 950887 -rw-r--r--    1 jos      jos        836840 Okt  9 18:50 entrynam.idx

 > showdb
Displays information on the currently available databases
# Name        Type ID  Qry All Comment
# ====        ==== ==  === === =======
swissbac      P    OK  OK  OK  SWISSPROT sequences of procaryotes 9/10/02
tpir          P    OK  OK  OK  PIR using NBRF access for 4 files
tsw           P    OK  OK  OK  Swissprot native format with EMBL CD-ROM 
index
tswnew        P    OK  OK  OK  Swissnew as 3 files in native format with 
EMBL CD-ROM index
twp           P    OK  OK  OK  EMBL new in native format with EMBL 
CD-ROM index
buch          N    OK  OK  OK  Buchnera database in DDBJ Format
fbuch         N    OK  OK  OK  Buchnera database in FASTA Format
tembl         N    OK  OK  OK  EMBL in native format with EMBL CD-ROM index
tgb           N    OK  -   -   Genbank IDs
tgenbank      N    OK  OK  OK  GenBank in native format with EMBL CD-ROM 
index

 > head bacteria.seq
ID   120K_RICRI     STANDARD;      PRT;  1300 AA.
AC   P14914;
--snipp

--snipp

 > seqret swissbac:120K_RICRI
Reads and writes (returns) sequences
Warning: Cannot open division file '<null>' for database 'swissbac'
Warning: seqCdQry failed
Error: Unable to read sequence 'swissbac:120K_RICRI'
 >

-- 
----------------------------------------------------------
Joerg Schaber
Instituto Cavanilles de Biodiversidad y Genetica Evolutiva
Universidad de Valencia               Tel.: ++34 96 398 3647
A.C. 22085                            Fax.: ++34 96 398 3670
46071 Valencia, Espa?a                email : jos at uv.es


From jweiner1 at ix.urz.uni-heidelberg.de  Thu Oct 10 09:17:51 2002
From: jweiner1 at ix.urz.uni-heidelberg.de (January Weiner 3)
Date: Thu, 10 Oct 2002 11:17:51 +0200 (METDST)
Subject: fasta splitter
In-Reply-To: <000d01c26ef9$f206f710$0a00a8c0@zeus>
Message-ID: <Pine.A41.4.42.0210101115070.37122-100000@aixterm1.urz.uni-heidelberg.de>

> This sounds great to me. Personally, I'd prefer not to have the leading
> zeros - just an incrementing ".[integer]" appended to the filename supplied.
> Makes shell manipulation easier.

Well, I'd prefer the former -- because it makes shell manipulation easier
:-)  If you stay with the leading 0's, then any listing will show the files
in the correct order, otherwise it will show "foo.1, foo.10, ...,
foo.100,...  foo.2, ..." etc.

j.


----)-\//-///-----------------------------------January-Weiner-3-------
"'Tis true, there's magic in the web of it." -- Shakespeare


From kenneth at geisshirt.dk  Mon Oct 14 11:02:15 2002
From: kenneth at geisshirt.dk (Kenneth Geisshirt)
Date: Mon, 14 Oct 2002 13:02:15 +0200 (CEST)
Subject: Splitting genbank
Message-ID: <Pine.LNX.4.44.0210141258100.361-100000@lithium>

Hi everyone

I recently joined the mailing list (after a couple of weeks usage of
EMBOSS) so I hope that my question isn't a FAQ.

I have a local copy of genbank, and I wish to split it into four
databases: one for humans, one of rats, one of mouses and one for the
rest. The applications seqret and seqretsplit can help me with the first
three by specifying the organism in the usa, but how do I specify "not
human and not rat and not mouse"?

Thanks in advance
  Kneth

-- 
Kenneth Geisshirt, M.Sc., Ph.D.         http://kenneth.geisshirt.dk
Gr?ndals Parkvej 2A, 3. sal                    kenneth at geisshirt.dk
DK-2720 Vanl?se                                     +45 38 87 78 38


From peter.rice at uk.lionbioscience.com  Mon Oct 14 11:27:34 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 14 Oct 2002 12:27:34 +0100
Subject: Splitting genbank
References: <Pine.LNX.4.44.0210141258100.361-100000@lithium>
Message-ID: <3DAAAA26.1080707@uk.lionbioscience.com>

Kenneth Geisshirt wrote:
> I have a local copy of genbank, and I wish to split it into four
> databases: one for humans, one of rats, one of mouses and one for the
> rest. The applications seqret and seqretsplit can help me with the first
> three by specifying the organism in the usa, but how do I specify "not
> human and not rat and not mouse"?

In EMBOSS ....

split the gbrod file into rat, mouse and other rodents (a simple perl 
script would do)

index and define GenBank

then define subsets using the same index files and exclude the ones you 
don't want using, for example:

exclude: "*pri* *rat* *mus*"

... in copies of your EMBOSS database definition for genbank.

EMBOSS simply checks the excluded files list when using the index files.

regards,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From Joerg.Schaber at uv.es  Mon Oct 14 12:11:59 2002
From: Joerg.Schaber at uv.es (Joerg Schaber)
Date: Mon, 14 Oct 2002 14:11:59 +0200
Subject: other indices
Message-ID: <3DAAB48F.6080704@uv.es>

Hi,

dbiflat allows to index other fields except id and accession number like 
sequence version (seqv), description (des), keywords and taxon. However, 
in the example databases that come with EMBOSS I found only field 
definitions like 'fields: "sv des org key"'. So do I access the 
additional indices (e.g. in seqret) via 'seqret-sv:\*', 
'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? 
'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*'  did not work.

Greetings,

joerg


From gwilliam at hgmp.mrc.ac.uk  Mon Oct 14 12:18:20 2002
From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522)
Date: Mon, 14 Oct 2002 13:18:20 +0100
Subject: other indices
References: <3DAAB48F.6080704@uv.es>
Message-ID: <3DAAB60C.ACF4111E@hgmp.mrc.ac.uk>

See:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/UniformSequenceAddress.html#keys

You append the 'sv', 'des', 'org', 'key', etc to the database name with
a '-' and to a file name with a ':', so:

with a database you use a command like:

seqret embl-des:fau


with a file you use a command like:

seqret filename:org:homo


Gary

Joerg Schaber wrote:
> 
> Hi,
> 
> dbiflat allows to index other fields except id and accession number like
> sequence version (seqv), description (des), keywords and taxon. However,
> in the example databases that come with EMBOSS I found only field
> definitions like 'fields: "sv des org key"'. So do I access the
> additional indices (e.g. in seqret) via 'seqret-sv:\*',
> 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively?
> 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*'  did not work.
> 
> Greetings,
> 
> joerg

-- 
Gary Williams               Tel: +44 1223 494522  Fax: +44 1223 494512
mailto:G.Williams at hgmp.mrc.ac.uk            http://www.hgmp.mrc.ac.uk/
Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK


From peter.rice at uk.lionbioscience.com  Mon Oct 14 12:20:40 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Mon, 14 Oct 2002 13:20:40 +0100
Subject: other indices
References: <3DAAB48F.6080704@uv.es>
Message-ID: <3DAAB698.1080108@uk.lionbioscience.com>

Joerg Schaber wrote:

> dbiflat allows to index other fields except id and accession number like 
> sequence version (seqv), description (des), keywords and taxon. However, 
> in the example databases that come with EMBOSS I found only field 
> definitions like 'fields: "sv des org key"'. So do I access the 
> additional indices (e.g. in seqret) via 'seqret-sv:\*', 
> 'seqret-des:\*','seqret-org:\*','seqret-key:\*', respectively? 
> 'seqret-taxon:\*', 'seqret-seqv:\*', and 'seqret-keyword:\*'  did not work.

For a database called schaber

dbiflat -fields "acnum,seqvn,des,keyword,taxon"

In the emboss.default definition:

DB schaber  [ type: P format: swiss method: emblcd
   dir: /data/schaber
   indexdir: /data/schaber
   comment: "Flatfiles database, all fields indexed"
   fields: "sv des org key"
]

In EMBOSS programs, use the USA:

'schaber-sv:\*'
'schaber-des:\*'
'schaber-org:\*'
'schaber-key:\*'

The confusion comes because the database definition (and the USA syntax) 
uses the field names in common use (e.g. in SRS) and dbiflat uses the 
EMBLCD/Staden index file names that dbiflat will be writing.

regards,

Peter

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From jmuehlis at uni-muenster.de  Tue Oct 15 09:03:19 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 15 Oct 2002 11:03:19 +0200
Subject: this format is not readable by seqret
Message-ID: <3DABD9D7.4101AA0D@uni-muenster.de>

Hello there,

my name is J?rg M?hlisch and I work in the Departement of pediatric
hematology and oncology at the University of Munster (Germany). As a
Scientist I use emboss on linux.

So here is my first question:

I have a sample of sequences in different formats. Before I try to index
them tested them for readablility by seqret:

find ./ -name "*" -exec seqret -osf fasta {} ../Sequencesothers/{} /;

Some of my files are not readable and I do not know the name of their
format:

Contig 1 (1,506)
  Contig Length:                  506 bases
  Average Length/Sequence:        458 bases
  Total Sequence Length:         1375 bases
  Top Strand:                       3 sequences
  Bottom Strand:                    0 sequences
  Total:                            3 sequences
^^
AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC...

May be there is a way to change this format in an apropriate way. 

Thanks

J?rg M?hlisch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021015/6754baf0/attachment-0001.vcf>

From peter.rice at uk.lionbioscience.com  Tue Oct 15 09:16:32 2002
From: peter.rice at uk.lionbioscience.com (Peter Rice)
Date: Tue, 15 Oct 2002 10:16:32 +0100
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de>
Message-ID: <3DABDCF0.6090809@uk.lionbioscience.com>

Joerg Muehlisch wrote:

> Some of my files are not readable and I do not know the name of their
> format:
> 
> Contig 1 (1,506)
>   Contig Length:                  506 bases
>   Average Length/Sequence:        458 bases
>   Total Sequence Length:         1375 bases
>   Top Strand:                       3 sequences
>   Bottom Strand:                    0 sequences
>   Total:                            3 sequences
> ^^
> AAMSCWATAGGGCGAATTGGAGCTCCACCGCGGTGGCGGYCGC...
> 
> May be there is a way to change this format in an apropriate way. 

Should be possible, if the format is common enough.

Where does the file come from? Does this program/package have an option to 
save in one of the (many) 'standard' formats?

regards,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723


From jmuehlis at uni-muenster.de  Tue Oct 15 09:44:28 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 15 Oct 2002 11:44:28 +0200
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de> <3DABDCF0.6090809@uk.lionbioscience.com>
Message-ID: <3DABE37C.B045FF6E@uni-muenster.de>

Hi,

in fact I hoped that anybody in the List would know where this format
comes from. In my file sample I just found some of thes unreadable
sequences.
As it does not seem to be a good known format, I will try to find out
where it is used.

Thanks

Jorg

Peter Rice wrote:

> Should be possible, if the format is common enough.
> 
> Where does the file come from? Does this program/package have an option to
> save in one of the (many) 'standard' formats?
> 
> regards,
> 
> Peter Rice
> 
> --
> ------------------------------------------------
> Peter Rice, LION Bioscience Ltd, Cambridge, UK
> peter.rice at uk.lionbioscience.com +44 1223 224723
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021015/ff4ad8b4/attachment-0001.vcf>

From kdj at sanger.ac.uk  Tue Oct 15 10:38:46 2002
From: kdj at sanger.ac.uk (Keith James)
Date: 15 Oct 2002 11:38:46 +0100
Subject: this format is not readable by seqret
In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de>
References: <3DABD9D7.4101AA0D@uni-muenster.de>
	<3DABDCF0.6090809@uk.lionbioscience.com>
	<3DABE37C.B045FF6E@uni-muenster.de>
Message-ID: <sc4u1joasw9.fsf@deskpro69.internal.sanger.ac.uk>

>>>>> "Joerg" == Joerg Muehlisch <jmuehlis at uni-muenster.de> writes:

    Joerg> Hi, in fact I hoped that anybody in the List would know
    Joerg> where this format comes from. In my file sample I just
    Joerg> found some of thes unreadable sequences.  As it does not
    Joerg> seem to be a good known format, I will try to find out
    Joerg> where it is used.

I _think_ this may be flatfile output from DNAStar/Lasergene. It's
been a while since I've seen any files like that but the ^^ delimiter
reminded me of it.

I don't have acces to the package to verify this.

Keith

-- 

- Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -


From jrvalverde at cnb.uam.es  Tue Oct 15 12:09:12 2002
From: jrvalverde at cnb.uam.es (Jos� R. Valverde)
Date: Tue, 15 Oct 2002 14:09:12 +0200
Subject: this format is not readable by seqret
In-Reply-To: <3DABE37C.B045FF6E@uni-muenster.de>
References: <3DABD9D7.4101AA0D@uni-muenster.de>
	<3DABDCF0.6090809@uk.lionbioscience.com>
	<3DABE37C.B045FF6E@uni-muenster.de>
Message-ID: <20021015140912.7294cd80.jrvalverde@cnb.uam.es>

On Tue, 15 Oct 2002 11:44:28 +0200
Joerg Muehlisch <jmuehlis at uni-muenster.de> wrote:

> Hi,
> 
> in fact I hoped that anybody in the List would know where this format
> comes from. In my file sample I just found some of thes unreadable
> sequences.
> As it does not seem to be a good known format, I will try to find out
> where it is used.
> 
May be it would help if you were able to post a full file sample.
>From the fragments you posted it looked like a sequencing project
file. It mentioned a contig size, with many gel readings of average
length and the orientation coverage of gels (+/- strands).

Iff the sequence contained (you only included a few bases) is just
the consensus, i.e. a single sequence of length exactly equal the
consensus length, then conversion should be trivial to any format.
Simply do a 'tail + 8 {}' 

Otherwise it might contain the gel readings (and the consensus?),
and then it would be a multiple sequence file, possibly with gel
overlaps et al. and conversion may be a bit more difficult. It may
be also that more than one contig and associated files is included in
one file, making processing more difficult.

Initially I would expect the second choice to be true, from the header:
several short sequences making up a contig plus the consensus, in your
example, the first contig would be 506 bases, composed of three gels
of average length 458. Since 1375/3 = 458, I deduce that the consensus
sequence is not included. Therefore you have a multiple sequence file
of overlapping gel readings.

You may try this:

	1) find out if more than one contig is in the file
	2) find out how sequences are separated
	3) decide what you want to do with them, e.g.
		split the file at "^Contig " lines
		strip comment lines (^*:*$)
		split at sequence separators

see csplit(1) for details on how to do it on a pipeline. E.g.
assuming sequences are delimited by a blank line, this _might_
work:
	csplit file /^Contig / -f config
	foreach i ( contig.* )
		tail +8 $i | csplit - /\
\
/ -f ${i}.gel
	end
(note that we need to scape newlines directly) and you'd get the raw 
sequences all right as contig.##.gel.##

				j


From jmuehlis at uni-muenster.de  Tue Oct 15 13:50:03 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 15 Oct 2002 15:50:03 +0200
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de>
		<3DABDCF0.6090809@uk.lionbioscience.com>
		<3DABE37C.B045FF6E@uni-muenster.de> <sc4u1joasw9.fsf@deskpro69.internal.sanger.ac.uk>
Message-ID: <3DAC1D0B.3ACD64FD@uni-muenster.de>

Keith James wrote:
Yes I think that might be, I think our collaboration Group is working
with DNAStar. But nevertehless there does not seem to be an emboss way
to change the file format. So I will try it with Linux tools like tr.
Thanks for your help.

Jorg
> I _think_ this may be flatfile output from DNAStar/Lasergene. It's
> been a while since I've seen any files like that but the ^^ delimiter
> reminded me of it.
> 
> I don't have acces to the package to verify this.
> 
> Keith
> 
> --
> 
> - Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
> - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021015/6b7822b9/attachment-0001.vcf>

From gbottu at ben.vub.ac.be  Mon Oct 21 08:34:38 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Mon, 21 Oct 2002 10:34:38 +0200 (CEST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210834.KAA1459646@black.vub.ac.be>

from : BEN

	Dear colleagues,
	
While doing some experimenting with fuzzpro, I tried the following :
-----------------
Input sequence(s): sw:pap?_carpa
Search pattern: <M(0,1)-A.
Number of mismatches [0]:
Output report [pap2_carpa.fuzzpro]:

   EMBOSS An error in embpat.c at line 725:
Unrecognised character in <M(0,1)-A
------------------
Yet I think I respected the PROSITE syntax. Anyone an idea ?

	Guy Bottu


From gbottu at ben.vub.ac.be  Mon Oct 21 08:34:46 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Mon, 21 Oct 2002 10:34:46 +0200 (CEST)
Subject: question about prophecy
Message-ID: <200210210834.KAA1459584@black.vub.ac.be>

from : BEN

	Dear colleagues,
	
I was looking at what the program prophecy is doing and I am puzzled. What is 
the difference between Gribskov and Henikoff profiles ? Both seem to have 
match/mismatch scores computed with the help of a scoring matrix as well as gap 
penalties. Furthermore, I thought that the Henikoff's made the Blocks databank 
using pprofiles without gaps. Can someone help me ?

	Guy Bottu


From ableasby at hgmp.mrc.ac.uk  Mon Oct 21 09:11:57 2002
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Mon, 21 Oct 2002 10:11:57 +0100 (BST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210911.KAA28060@bromine.hgmp.mrc.ac.uk>

Terminating full-stops are currently not part of the EMBOSS 
implementation of PROSITE patterns. Strictly they are,
although unnecessary, part of the PROSITE syntax so we
can accept them for future releases. For now if you just
omit the '.' the pattern will work.

Alan


From gbottu at ben.vub.ac.be  Mon Oct 21 09:36:05 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Mon, 21 Oct 2002 11:36:05 +0200 (CEST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210936.LAA1462808@black.vub.ac.be>

Without the '.' it does not give an error. I get : 
------------------
> fuzzpro
Protein pattern search
Input sequence(s): sw:pap?_carpa
Search pattern: <M(0,1)-A
Number of mismatches [0]:
Output report [pap2_carpa.fuzzpro]:
-------------------
the output file however turns out to be empty. Yet it should have found sw:papa_carpa, which 
starts with :   MAMI...

	Guy Bottu


From ableasby at hgmp.mrc.ac.uk  Mon Oct 21 09:50:21 2002
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Mon, 21 Oct 2002 10:50:21 +0100 (BST)
Subject: question about fuzzpro and PROSITE
Message-ID: <200210210950.g9L9oLt03933@sulphur.hgmp.mrc.ac.uk>

We'll look into that. Looks to be a boundary condition
affecting zero length N terminal ranges.

Thanks

Alan


From simon.andrews at bbsrc.ac.uk  Mon Oct 21 14:00:35 2002
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Mon, 21 Oct 2002 15:00:35 +0100
Subject: Indexing Refseq
Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28753@bi-exsrv1.iapc.bbsrc.ac.uk>

I'm having all sorts of problems working with the latest release of RefSeq, due to a change in the way the files are being laid out.

In older releases of RefSeq the LOCUS identifier was the same as the accession number (eg NM_0123456), but in the latest version the LOCUS identifier is the gene identifier, and these aren't unique in the database!!

This means that when I run dbiflat (even using -idformat REFSEQ) I get a load of warnings about duplicate entries and when I later try to use the database I find that a load of entries are inaccessible because of this.

For example accessions NM_134265,NM_134264 and NM_015626 all have the ID WSB1.

How can I get dbiflat to index with the accession number as it's primary identifier so I don't lose entries when indexing them??

Thanks

Simon

PS This actually looks like a mistake by the RefSeq curators - I mean who thought that having a non-unique primary sequence identifier was a good idea!!!

--
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews at bbsrc.ac.uk
+44 (0)1223 496463 


From simon.andrews at bbsrc.ac.uk  Mon Oct 21 15:24:39 2002
From: simon.andrews at bbsrc.ac.uk (simon andrews (BI))
Date: Mon, 21 Oct 2002 16:24:39 +0100
Subject: Indexing Refseq
Message-ID: <2DC41140A89ED411989D00508BDCD9ED01E28754@bi-exsrv1.iapc.bbsrc.ac.uk>

> -----Original Message-----
> From: simon andrews (BI) [mailto:simon.andrews at bbsrc.ac.uk]
> Subject: Indexing Refseq
> 
> 
> I'm having all sorts of problems working with the latest 
> release of RefSeq
>
> This means that when I run dbiflat (even using -idformat 
> REFSEQ) I get a load of warnings about duplicate entries and 
> when I later try to use the database I find that a load of 
> entries are inaccessible because of this.
> 
> For example accessions NM_134265,NM_134264 and NM_015626 all 
> have the ID WSB1.

Just to follow up to myself - I've found a temporary work-round for this problem.  The Bioperl script at the bottom of the message will pre-process the current Refseq files into a format which dbiflat can then index without errors.  You will see a warning from the NC_xxxx chromosome files in Refseq, but as these are only features with no sequence I wasn't too worried about them and just skipped them.

Usage of the script is "script_name [infile] > outfile".

	TTFN

	Simon.
-------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

# This script is a filter through which we can
# pass the whole of refseq. Newer versions of
# refseq replaced their locus ID with a string
# which wasn't the accession number.  This
# just changes them back.

my ($filename) = @ARGV;

die "No filename given" unless ($filename);

my $in = Bio::SeqIO -> new(-file => $filename,
			      -format => 'genbank');

die "Couldn't read $filename" unless ($in);

my $out = Bio::SeqIO -> new(-fh => \*STDOUT,
			    -format => 'genbank');

die "Couldn't make output pipe" unless ($out);

while (my $seq = $in -> next_seq()){

  # Some NC_xxx seqs are in the Refseq file
  # but don't have any sequence attached. We'll
  # skip those files...

  next if ($seq -> accession =~ /^NC/);

  $seq -> display_id($seq-> accession());

  $out -> write_seq($seq);

}
#-------------------------------------------------------


From jmuehlis at uni-muenster.de  Tue Oct 22 08:06:55 2002
From: jmuehlis at uni-muenster.de (Joerg Muehlisch)
Date: Tue, 22 Oct 2002 10:06:55 +0200
Subject: this format is not readable by seqret
References: <3DABD9D7.4101AA0D@uni-muenster.de>
			<3DABDCF0.6090809@uk.lionbioscience.com>
			<3DABE37C.B045FF6E@uni-muenster.de> <sc4u1joasw9.fsf@deskpro69.internal.sanger.ac.uk> <3DAC1D0B.3ACD64FD@uni-muenster.de>
Message-ID: <3DB5071F.EEB21A1@uni-muenster.de>

Hi,

Just for your information. This is the answer from my collaborators:

The sequence is a DNAStar  EditSeq file.  The notation indicates that
this 
sequence is consensus sequence from multiple reads put into a contig. 
If 
you do not have DNAStar, try to open with a wordprocessor program and
cut 
and paste the sequence into whatever sequence editor you use.  The
sequence 
uses standard nomenclature (ie. W = A or T; M = A or C; etc.....)

Thanks for your help.

As this format is not readable I will now just change the format by
other means.

Jorg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jmuehlis.vcf
Type: text/x-vcard
Size: 339 bytes
Desc: Karte f?r Joerg Muehlisch
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021022/d8b859d3/attachment-0001.vcf>

From Andres.Aeschlimann at id.unibe.ch  Tue Oct 22 15:23:31 2002
From: Andres.Aeschlimann at id.unibe.ch (Andres Aeschlimann)
Date: Tue, 22 Oct 2002 17:23:31 +0200 (MET DST)
Subject: Cannot connect!
Message-ID: <Pine.GSO.4.21.0210221655360.14608-100000@ubecx01>


Hi all

Having installed jemboss for the first time. 
There's still a problem left:

After launching emboss from
http://ubecx04.unibe.ch:8080/jemboss/Jemboss.jnlp ( a trial campus emboss
server )

the webstart window appears as it should, and the login window as well, 
where username and password can be entered. Later on the window says

Cannot connect! and a window

"Check Public Server Settings" with the contents of the jemboss.properties
file:

user.auth=true
jemboss.server=true
server.public=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter
server.private=https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter
service.public=JembossAuthServer
service.private=JembossAuthServer
plplot=/products/emboss/emboss/share/EMBOSS/
embossData=/products/emboss/emboss/share/EMBOSS/data/
embossBin=/products/emboss/emboss/bin/
embossPath=/usr/bin/:/bin:/packages/clustal/:/packages/primer3/bin:
acdDirToParse=/products/emboss/emboss/share/EMBOSS/acd/
embossURL=http://www.uk.embnet.org/Software/EMBOSS/Apps/

appears. soap-2_3_1 and jakarta-tomcat-4.1.12 are installed as described
in order to use with
ftp://ftp.hgmp.mrc.ac.uk/pub/EMBOSS/patchfiles/install-jemboss-server.sh

rpcrouter listens on https://ubecx04.unibe.ch:8443/soap/servlet/rpcrouter
: 

SOAP RPC Router

Sorry, I don't speak via HTTP GET- you have to use HTTP POST to talk to me.


ubecx04:/products/emboss.222 % java -version
java version "1.4.0_00"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_00-b05)
Java HotSpot(TM) Client VM (build 1.4.0_00-b05, mixed mode)

on Solaris 9.

Is there any log file where the cause would be explained? 

Thanks in advance for any hint.

Res
=========================================================
Dr. Andres Aeschlimann     Andres.Aeschlimann at id.unibe.ch
University of Berne
Gesellschaftsstrasse 6
CH-3012 BERNE              tel: +41 31 631 3845
Switzerland                fax: +41 31 631 3865


From gbottu at ben.vub.ac.be  Thu Oct 24 14:20:43 2002
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Thu, 24 Oct 2002 16:20:43 +0200 (CEST)
Subject: questions about codon usage tables
Message-ID: <200210241420.QAA1196695@black.vub.ac.be>

from : BEN

	Dear colleagues,
	
I just took a look at codon usage tables under EMBOSS.

- there is a list of tables in .../share/EMBOSS/data/CODONS. Unfortunately, they 
have rather cryptic names. Is there a way to find out for which organism they 
are ? And from which data source do they come ?

- there is a program cutgextract. I tried it :

> cutgextract
Extract data from CUTG
CUTG directory [.]: /db/cutg	(here is the file cutg.dat)

But it does ... nothing. 

Anyone a clue ?

	Sincerely,
	Guy Bottu


From areagp61 at yahoo.it  Fri Oct 25 09:03:36 2002
From: areagp61 at yahoo.it (Graziano P.)
Date: Fri, 25 Oct 2002 11:03:36 +0200
Subject: -filter option for water and stretcher
Message-ID: <001e01c27c05$690c7520$18105709@italy.ibm.com>

Hi All,
I need to introduce sequences by standard input. I have found the -filter
qualifier in
the -help -verbose options. For example, if I use this qualifier for
"transeq" I write:
transeq -filter

then I have to insert my sequence (in fasta format for example) pasting or
writing it. When I have finished writing or pasting the sequences, I have to
press CTRL-D to terminate the standard input introduction. Finally  the
program return the standard output.

I have tried to use the -filter qualifier with "water" and "stretcher".
These two programs require two sequences in input in different files.
If I write as standard input:

>HTRE_ECOLI P33129 OUTER MEMBRANE USHER PROTEIN ...
PGVYDVSVYVNDQPIINQSITFVAIEGKKNAQACITLKNLLQFHINSPDINNEKAVLLAR
DETLGNCLNLTEIIPQASVRYDVNDQRLDIDVPQAWVMKNYQNYVDPSLWENGINAAMLS
NDQRLDIDVP

>YCJV_ECOLI P77481 HYPOTHETICAL ABC TRANSPORTER ...
MAQLSLQHIQKIYDNQVHVVKDFNLEIADKEFIVFVGPSGCGKSTTLRMIAGLEEISGGD
LLIDGKRMNDVPAKARNIAMVFQNYALYPHMTVYDNMAFGLKMQKIAKEVIDERVNWAAQ
KISVAELTGAEFMLYTTVGGTS

when I press CTRL-D I get the following error message:

Error: Unable to read sequence ''

How can I tell to standard input that what I paste or write are two
different sequences?
Is there any separator character that do it?

Best regards
Graziano

______________________________________________________________________
Scarica il nuovo Yahoo! Messenger: con webcam, nuove faccine e tante altre novit?.
http://it.yahoo.com/mail_it/foot/?http://it.messenger.yahoo.com/


From aralp001 at udcf.gla.ac.uk  Fri Oct 25 15:04:22 2002
From: aralp001 at udcf.gla.ac.uk (Dr Adam Ralph)
Date: Fri, 25 Oct 2002 16:04:22 +0100 (BST)
Subject: multi-page graphical output
In-Reply-To: <3DA4613B.3010901@uv.es>
Message-ID: <Pine.SOL.4.10.10210251549490.1853-100000@lenzie.cent.gla.ac.uk>


Dear Anyone,

   I am trying to write a program which outputs a graph, similar to 
plotcon or cpgplot. It would appear that the way these programs are
constructed, the graph is plotted on one page. Thus if you have a large
sequence the graph looks a bit of a mess. Other types of graphical program 
(like prettyplot) which plot lines of text are able to alter the number of
characters per line and produce multiple pages.
   My question is can someone show me or give me an example program
which splits histogram/graph plots into multiple pages? Thus on one 
page you can have a graph of residues 1-1000, then graph of 1001-2000 etc.

Thanks in advance
Adam


Dr. Adam Ralph
Institute of Virology
University of Glasgow
Church Street
Glasgow
G11 5JR

Phone: 0141 330 6268
Fax:   0141 337 2236 
email: a.ralph at vir.gla.ac.uk


From ggaz at cpqrr.fiocruz.br  Wed Oct  9 21:19:56 2002
From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli)
Date: Wed, 9 Oct 2002 18:19:56 -0300
Subject: jemboss
Message-ID: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br>

I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this. 
Could you help me?
Thanks,
Solange Busek
Centro de Pesquisas Ren? Rachou/FIOCRUZ

--
Esta mensagem foi "escaneada" pelo MailScanner a procura
de virus e codigo malicioso, e acredita-se que esteja "limpa".
Servico de Informatica - CPqRR/FIOCRUZ.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021009/9c1645cb/attachment-0001.html>

From ggaz at cpqrr.fiocruz.br  Wed Oct  9 21:15:32 2002
From: ggaz at cpqrr.fiocruz.br (Prof. Giovanni Gazzinelli)
Date: Wed, 9 Oct 2002 18:15:32 -0300
Subject: jemboss
Message-ID: <000801c26fd9$9e235fe0$6500a8c0@cpqrr.fiocruz.br>

I would like to use the jemboss (interface java for emboss) but I need to enroll in HGPM and I don?t know how can I do this. Could you send me the email that I can do this?
Thanks,
Solange Busek
Centro de Pesquisas Ren? Rachou/FIOCRUZ

 
--
Esta mensagem foi "escaneada" pelo MailScanner a procura
de virus e codigo malicioso, e acredita-se que esteja "limpa".
Servico de Informatica - CPqRR/FIOCRUZ.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20021009/0c18a953/attachment-0001.html>

From tcarver at hgmp.mrc.ac.uk  Mon Oct 28 18:15:35 2002
From: tcarver at hgmp.mrc.ac.uk (Dr T. Carver)
Date: Mon, 28 Oct 2002 18:15:35 +0000 (GMT)
Subject: jemboss
In-Reply-To: <000901c26fd9$9e2b0100$6500a8c0@cpqrr.fiocruz.br>
Message-ID: <Pine.SOL.4.44.0210281813060.1860-100000@bromine>

Hi

You can register at the HGMP by filling out the form at:
http://www.hgmp.mrc.ac.uk/About/Registration/

Then send it to:
UK MRC HGMP Resource Centre
Hinxton
Cambridge
CB10 1SB
UK

You will then be sent an HGMP username and password.

Regards
Tim Carver

On Wed, 9 Oct 2002, Prof. Giovanni Gazzinelli wrote:

> I would like to use the jemboss program but I need to enroll in HGMP and I don?t know how can I do this.
> Could you help me?
> Thanks,
> Solange Busek
> Centro de Pesquisas Ren? Rachou/FIOCRUZ
>
> --
> Esta mensagem foi "escaneada" pelo MailScanner a procura
> de virus e codigo malicioso, e acredita-se que esteja "limpa".
> Servico de Informatica - CPqRR/FIOCRUZ.
>
>


From David.Lapointe at umassmed.edu  Mon Oct 28 22:21:55 2002
From: David.Lapointe at umassmed.edu (Lapointe, David)
Date: Mon, 28 Oct 2002 17:21:55 -0500
Subject: Emboss on Solaris.
Message-ID: <13B2F22F9D5DD611B07700508BB1E88F019A2D7A@edunivexch02.umassmed.edu>

We've moved to a Netra T1 and I am having problems with the PNG libraries. I
get these runtime errors
using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am
I missing?


$ prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: opsin.msf
Graph type [x11]: png
libpng warning: Application was compiled with png.h from libpng-1.0.6
libpng warning: Application  is  running with png.c from libpng-1.2.4
gd-png:  fatal libpng error: Incompatible libpng version in application and
library

David Lapointe
Senior Informaticist / Information Services
Assistant Professor / Cell Biology
UMass Worcester
(508) 856-5141


From David.Bauer at SCHERING.DE  Tue Oct 29 06:37:00 2002
From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE)
Date: Tue, 29 Oct 2002 07:37:00 +0100
Subject: Antwort: Emboss on Solaris.
Message-ID: <OF42A961B2.334D459D-ONC1256C61.0023C200@schering.de>


Hi,

I also had some problems with this on Solaris.
Did you try to run configure with "--with-pngdriver=DIR"?.
This helps EMBOSS to pick the right header files.

David.


We've moved to a Netra T1 and I am having problems with the PNG libraries. I
get these runtime errors
using png as output (postscript/X11 work fine). The png.h is 1.2.4. What am
I missing?


$ prettyplot
Displays aligned sequences, with colouring and boxing
Input sequence set: opsin.msf
Graph type [x11]: png
libpng warning: Application was compiled with png.h from libpng-1.0.6
libpng warning: Application  is  running with png.c from libpng-1.2.4
gd-png:  fatal libpng error: Incompatible libpng version in application and
library

David Lapointe
Senior Informaticist / Information Services
Assistant Professor / Cell Biology
UMass Worcester
(508) 856-5141


From shibl at seqbio.com  Wed Oct 30 16:13:08 2002
From: shibl at seqbio.com (Shibl Mourad)
Date: Wed, 30 Oct 2002 11:13:08 -0500
Subject: Emboss Expert System
Message-ID: <002c01c2802f$3fec6370$2602a8c0@SEQUENCE>

Dear EMBOSS user,

We are currently developing an expert system that will complement EMBOSS.
As there are roughly 200 tools packaged within EMBOSS alone, the task to
locate the 'right' tool, especially if you are newcomer to the
bioinformatics field, can be overwhelming.

Our expert system, openExpert, aims to simulate the 'question and answer'
conversation one would have with a bioinformatics 'expert' -  but minus
their presence and wage.  Although it is currently populated with only the
EMBOSS suite, we aim to broaden the knowledge base of openExpert to
encompass all known bioinformatics tools.

We are looking for 5 EMBOSS users to review the system.  The review should
not take more than 30 minutes of your time and it would be of great value to
us.  If you are interested, please email shibl at seqbio.com.  If you would
like to try openExpert without providing a review, please indicate so in
your email and we will provide with free access.

Help us make openExpert a valuable expert system for bioinformatics.


Thank you,

Shibl Mourad,
President
Sequence Bioinformatics


From newgene at bigfoot.com  Thu Oct 31 17:43:06 2002
From: newgene at bigfoot.com (clwu)
Date: Thu, 31 Oct 2002 11:43:06 -0600
Subject: emboss in cygwin
Message-ID: <3DC16BAA.1050201@bigfoot.com>

Hi, group,
           I am new to group. I tried to compile EMBOSS under 
win2K/cygwin but I failed. EMBOSS website at HGMP mentioned that
"Richard Bruskiewich and Simon Kelley at the Sanger Centre have 
succeeded in compiling EMBOSS under Windows NT using the CygWin package. 
The resulting executables have been tested but not thoroughly enough for 
a release. Contact Richard Bruskiewich for more information. ". But I 
can not follow the link in this page to get help.
          Does anyone have the successful experience on this? Are there 
pre-complied executables for cygwin available, even part of those 
standalone programs? That will help me a lot.

Thank you in advance.


clwu