From aengus.stewart at cancer.org.uk  Thu Mar  2 09:56:25 2006
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Thu, 02 Mar 2006 14:56:25 +0000
Subject: [EMBOSS] DB - finding out how many sequences
Message-ID: <44070799.6090804@cancer.org.uk>


Hi,

Does any of the EMBOSS apps output the number of sequences that it has searched?

I am after this figure as I have a data library issue.

Some sequences are "not found" by EMBOSS even though I know they are in the original flat files.

I am trying to figure out if this is

configure problem
data problem
indexing problem.

The indexing with dbiflat doesnt complain but I would like to be able to check my input number of sequences with what EMBOSS thinks was output.


Cheers
Aengus


-- 
-----------------------------------------------------------------------
Aengus Stewart
Group Leader
Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.


From jison at ebi.ac.uk  Thu Mar  2 11:35:35 2006
From: jison at ebi.ac.uk (Jon Ison)
Date: Thu, 2 Mar 2006 16:35:35 -0000 (GMT)
Subject: [EMBOSS] DB - finding out how many sequences
In-Reply-To: <44070799.6090804@cancer.org.uk>
References: <44070799.6090804@cancer.org.uk>
Message-ID: <43713.172.31.100.168.1141317335.squirrel@webmail.ebi.ac.uk>

Ay up Aengus

Not so far as I'm aware although you could get that number
indirectly by using infoseq.

You could try using dbxflat too ... which does generate some
stats on the input data -  don't know whether the stats include
the number of sequences that were indexed but its worth a look.

Cheers

Jon


>
> Hi,
>
> Does any of the EMBOSS apps output the number of sequences that it has searched?
>
> I am after this figure as I have a data library issue.
>
> Some sequences are "not found" by EMBOSS even though I know they are in the original flat files.
>
> I am trying to figure out if this is
>
> configure problem
> data problem
> indexing problem.
>
> The indexing with dbiflat doesnt complain but I would like to be able to check my input number of sequences with what
> EMBOSS thinks was output.
>
>
> Cheers
> Aengus
>
>
> --
> -----------------------------------------------------------------------
> Aengus Stewart
> Group Leader
> Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
> Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
> -----------------------------------------------------------------------
>
> This electronic message contains information which may be privileged and
> confidential.  The information is intended to be for the use of the
> individual(s) or entity named above. Be aware that any third party
> disclosure, distribution, copying or use of this communication, without
> prior permission, is strictly prohibited.
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From pmr at ebi.ac.uk  Thu Mar  2 12:39:07 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 02 Mar 2006 17:39:07 +0000
Subject: [EMBOSS] DB - finding out how many sequences
In-Reply-To: <44070799.6090804@cancer.org.uk>
References: <44070799.6090804@cancer.org.uk>
Message-ID: <44072DBB.1060506@ebi.ac.uk>

Hi Aengus,

> Does any of the EMBOSS apps output the number of sequences that it has searched?
> 
> I am after this figure as I have a data library issue.
> 
> Some sequences are "not found" by EMBOSS even though I know they are in the original flat files.

What is your database definition?

regards,

Peter


From joanne at bioinformatics.ubc.ca  Mon Mar  6 17:51:00 2006
From: joanne at bioinformatics.ubc.ca (Joanne Fox)
Date: Mon, 06 Mar 2006 14:51:00 -0800
Subject: [EMBOSS] Warning: Cannot open division file '<null>' for database
	'swissprot'
Message-ID: <440CBCD4.8030605@bioinformatics.ubc.ca>

Hello EMBOSS community,

I used dbiflat to index the latest flatfile distribution of swissprot 
(uniprot_sprot.dat).  Now I am trying to use this database with the 
EMBOSS patmatdb program and I'm encountering an error that reads, 
"Warning: Cannot open division file '<null>' for database 'swissprot'".

I searched the mailing list archives and I see others with this same 
problem that boil down to permissions and/or path problems.  However, I 
still can't figure out what's going wrong on my system.  I've put more 
detailed information below.  I'm new to the world of configuring EMBOSS 
so if anyone has any ideas about what might be going wrong, I'd really 
appreciate the advice.

Thanks,
Joanne.
-- 
| Joanne Fox
| http://bioinformatics.ubc.ca/people/joanne


~> showdb
Displays information on the currently available databases
# Name        Type ID  Qry All Comment
# ====        ==== ==  === === =======
swissprot     P    OK  OK  OK  Swissprot Release 7.1, 2/21/2006

~> patmatdb
Search a protein sequence with a motif
Input sequence(s): swissprot
Warning: Cannot open division file '<null>' for database 'swissprot'
Error: Unable to read sequence 'swissprot'
Input sequence(s): swissprot:*
Warning: Cannot open division file '<null>' for database 'swissprot'
Error: Unable to read sequence 'swissprot:*'
Died: patmatdb terminated: Bad value for '-sequence' and no more retries

contents of .embossrc file:
set emboss_logfile /usr/local/software/bioinformatics/emboss/log/emboss.log
set emboss_database_dir /raid1/bioinformatics/data
DB swissprot [
    type: P
    method: emblcd
    format: swissprot
    dir: \$emboss_database_dir/swissprot/swissprot_V7_1
    file: "*.dat"
    release: "7.1"
    comment: "Swissprot Release 7.1, 2/21/2006"
]

contents of the /raid1/bioinformatics/data/swissprot/swissprot_V7_1/ 
directory:
-rw-r--r--    1 bin      bin       1165524 Mar  6 12:37 acnum.hit
-rw-r--r--    1 bin      bin       3899118 Mar  6 12:37 acnum.trg
-rw-r--r--    1 bin      bin           322 Mar  6 12:37 division.lkp
-rw-r--r--    1 bin      bin       4368405 Mar  6 12:37 entrynam.idx
-rw-r--r--    1 bin      bin      802445434 Mar  6 12:23 uniprot_sprot.dat


From yezhiqiang at gmail.com  Mon Mar  6 16:29:49 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Tue, 7 Mar 2006 05:29:49 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein sequence?
Message-ID: <34198fe40603061329u85c9f95p@mail.gmail.com>

Dear all,

      Does emboss have a handy way for mutate a protein sequence by
the specified way?
For example, I have a sequence foo.fasta

>foo
MATSCGLLKIIQRE

It has a mutant called  'A2L'. Is there any way to do this operation
to output(with an option to check the foo.fasta has 'A' at position
2):
>foo A2L
MLTSCGLLKIIQRE

My way:  use extractseq to extract two file: one before position 2,
the other after postion 2. Then creat a fasta file contain 'L'.  After
that,  I use union to connect these 3 sequence file in to one.

Or write a perl script to do this by change a string's substring.

How If emboss could provide a 'mutate' !

Thank you :)

Best regards!
--
Zhiqiang Ye


From Marc.Logghe at DEVGEN.com  Tue Mar  7 02:59:27 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Tue, 7 Mar 2006 08:59:27 +0100
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746B99@ANTARESIA.be.devgen.com>

Can msbar do something for you ? Msbar = "Mutate sequence beyond all
recognition"

Cheers,
Marc


> -----Original Message-----
> From: emboss-bounces at emboss.open-bio.org 
> [mailto:emboss-bounces at emboss.open-bio.org] On Behalf Of Zhiqiang Ye
> Sent: Monday, March 06, 2006 10:30 PM
> To: emboss at emboss.open-bio.org
> Subject: [EMBOSS] Does emboss have a handy way for mutate a 
> protein sequence?
> 
> Dear all,
> 
>       Does emboss have a handy way for mutate a protein 
> sequence by the specified way?
> For example, I have a sequence foo.fasta
> 
> >foo
> MATSCGLLKIIQRE
> 
> It has a mutant called  'A2L'. Is there any way to do this 
> operation to output(with an option to check the foo.fasta has 
> 'A' at position
> 2):
> >foo A2L
> MLTSCGLLKIIQRE
> 
> My way:  use extractseq to extract two file: one before 
> position 2, the other after postion 2. Then creat a fasta 
> file contain 'L'.  After that,  I use union to connect these 
> 3 sequence file in to one.
> 
> Or write a perl script to do this by change a string's substring.
> 
> How If emboss could provide a 'mutate' !
> 
> Thank you :)
> 
> Best regards!
> --
> Zhiqiang Ye
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
> 


From David.Bauer at SCHERING.DE  Tue Mar  7 01:40:24 2006
From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE)
Date: Tue, 7 Mar 2006 07:40:24 +0100
Subject: [EMBOSS] Antwort: Does emboss have a handy way for mutate a protein
 sequence?
In-Reply-To: <34198fe40603061329u85c9f95p@mail.gmail.com>
Message-ID: <OFDFF9CC1E.C290B537-ONC125712A.00246F55-C125712A.0024A951@schering.de>


What about this solution:

cutseq foo.fasta -from 2 -to 2 | pasteseq -filter -pos 1 -bs asis:'L' |
descseq -filter -append -desc "A2L"

>foo A2L
MLTSCGLLKIIQRE

Cheers,
David.


emboss-bounces at emboss.open-bio.org schrieb am 06/03/2006 22:29:49:

> Dear all,
>
>       Does emboss have a handy way for mutate a protein sequence by
> the specified way?
> For example, I have a sequence foo.fasta
>
> >foo
> MATSCGLLKIIQRE
>
> It has a mutant called  'A2L'. Is there any way to do this operation
> to output(with an option to check the foo.fasta has 'A' at position
> 2):
> >foo A2L
> MLTSCGLLKIIQRE
>
> My way:  use extractseq to extract two file: one before position 2,
> the other after postion 2. Then creat a fasta file contain 'L'.  After
> that,  I use union to connect these 3 sequence file in to one.
>
> Or write a perl script to do this by change a string's substring.
>
> How If emboss could provide a 'mutate' !
>
> Thank you :)
>
> Best regards!
> --
> Zhiqiang Ye
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss


From yezhiqiang at gmail.com  Tue Mar  7 08:22:57 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Tue, 7 Mar 2006 21:22:57 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746B99@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA6746B99@ANTARESIA.be.devgen.com>
Message-ID: <34198fe40603070522r5d0920f9q@mail.gmail.com>

2006/3/7, Marc Logghe <Marc.Logghe at devgen.com>:
> Can msbar do something for you ? Msbar = "Mutate sequence beyond all
> recognition"

Thank you. I have checked msbar, it cannot do what I need.


--
Zhiqiang Ye


From pmr at ebi.ac.uk  Tue Mar  7 08:55:39 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 07 Mar 2006 13:55:39 +0000
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <34198fe40603061329u85c9f95p@mail.gmail.com>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
Message-ID: <440D90DB.6040605@ebi.ac.uk>

Zhiqiang Ye wrote:

> Dear all,
> 
>       Does emboss have a handy way for mutate a protein sequence by
> the specified way?
> For example, I have a sequence foo.fasta
> 
> 
>>foo
> 
> MATSCGLLKIIQRE
> 
> It has a mutant called  'A2L'. Is there any way to do this operation
> to output(with an option to check the foo.fasta has 'A' at position
> 2):
> 
>>foo A2L
> 
> MLTSCGLLKIIQRE

EMBOSS has several programs to change sequences. None does exactly what you ask.

You could look at:

biosed (does what you ask for longer replacements, but will change all 'A's to 
'L's.)

We could extend biosed to specify the position of the pattern ... is that what 
you need?

regards,

Peter


From gbottu at ben.vub.ac.be  Tue Mar  7 10:38:08 2006
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Tue, 7 Mar 2006 16:38:08 +0100
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence
In-Reply-To: <34198fe40603061329u85c9f95p@mail.gmail.com>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
Message-ID: <20060307153808.GA15947@bigben.ulb.ac.be>

Your solution works, as does the one proposed by David Bauer. Both are 
however rather tedious. It goes much easier with an interactive sequence 
editor. There is MSE, not a standard EMBOSS program but distributed as 
Embassadir. It runs in a VT100 terminal ; it is a little bit 
intimidating for the novice user, but you can with some practice learn to 
use it. At the BEN site we have besides MSE also installed SeaView, a 
graphical mode editor (has versions for Windows, Macintosh and X-Window). 
These editors are of course only usable if you work locally in your own 
computer or in a terminal session in a remote computer. It will not work 
if you are using a Web interface for EMBOSS ... although some Web 
interfaces might have an applet mode editor that allows to save the 
modified sequence back on the server (is there one in Jemboss ?).

	Hope this helps,
	Guy Bottu,
	Belgian EMBnet Node


From yezhiqiang at gmail.com  Tue Mar  7 08:25:02 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Tue, 7 Mar 2006 21:25:02 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <OFDFF9CC1E.C290B537-ONC125712A.00246F55-C125712A.0024A951@schering.de>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
	<OFDFF9CC1E.C290B537-ONC125712A.00246F55-C125712A.0024A951@schering.de>
Message-ID: <34198fe40603070525w1ffa7155n@mail.gmail.com>

2006/3/7, David.Bauer at schering.de <David.Bauer at schering.de>:
> What about this solution:
>
> cutseq foo.fasta -from 2 -to 2 | pasteseq -filter -pos 1 -bs asis:'L' |
> descseq -filter -append -desc "A2L"

Thanks a lot.  It works very well!

Best

--
Zhiqiang Ye


From yezhiqiang at gmail.com  Tue Mar  7 11:33:40 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Wed, 8 Mar 2006 00:33:40 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <440D90DB.6040605@ebi.ac.uk>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
	<440D90DB.6040605@ebi.ac.uk>
Message-ID: <34198fe40603070833p627159b0i@mail.gmail.com>

2006/3/7, Peter Rice <pmr at ebi.ac.uk>:
>
> EMBOSS has several programs to change sequences. None does exactly what you ask.
>
> You could look at:
>
> biosed (does what you ask for longer replacements, but will change all 'A's to
> 'L's.)

 Yeah, it will change all 'A's to 'L's...

> We could extend biosed to specify the position of the pattern ... is that what
> you need?
>
 Yes! If biosed can be extended to do this, it will be better :)

Best Regards!

--
Zhiqiang Ye


From yezhiqiang at gmail.com  Tue Mar  7 11:42:00 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Wed, 8 Mar 2006 00:42:00 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence
In-Reply-To: <20060307153808.GA15947@bigben.ulb.ac.be>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
	<20060307153808.GA15947@bigben.ulb.ac.be>
Message-ID: <34198fe40603070842i15a34763s@mail.gmail.com>

hi, Guy Bottu

    Thank you. But I have to do a batch of these subsitituion, so a
command line solution will be better.  I write an ugly shell script to
do this according to David Bauer.

#!/bin/sh

mutation=$2;
WT=${mutation:0:1};
POS=${mutation:1:${#mutation}-2};
MT=${mutation: -1}
POS2=`expr $POS - 1`

cat $1 | cutseq -filter -from $POS -to $POS | pasteseq -filter -pos
$POS2 -bs asis:$MT | descseq -filter -append -desc " (mutant:
$mutation )"


With this script mutate.sh in my ~/bin, I can type this:
mutate.sh foo.fasta A2L

Best

--
Zhiqiang Ye


From Marc.Logghe at DEVGEN.com  Wed Mar  8 04:00:14 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Wed, 8 Mar 2006 10:00:14 +0100
Subject: [EMBOSS] Oddcomp behaves oddly ...
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746BA7@ANTARESIA.be.devgen.com>

... Or rather, how should I use it properly ?

OK, suppose your run compseq to obtain the frequency for individual
residues:
compseq tsw:Q62671 -word 1
Apparently this example protein sequence is rather rich in leucine (106
L out of 889).

In order to detect this leucine bias, a little file was created
(leu.comp) that had the following content:
<file leu.comp>
Word size       1
Total count     0

# bias should be detected as 106 > 100
L       100
</file leu.comp>

Oddcomp was run like this:
oddcomp tsw:Q62671 -infile leu.comp -window 889

But the sequece is not reported.
When I change the L count to 10 in leu.comp it does not work neither.
Strangely enough, when the default window is taken (30) the sequence is
reported.
What is happening here ?

Regards,
Marc


From d.gatherer at vir.gla.ac.uk  Wed Mar  8 04:30:13 2006
From: d.gatherer at vir.gla.ac.uk (Derek Gatherer)
Date: Wed, 08 Mar 2006 09:30:13 +0000
Subject: [EMBOSS] clustalw vs. emma
Message-ID: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>

Morning all

Is there some unusual default being passed to emma?  For instance, 
here's emma with a vanilla set of parameters on a fairly well 
conserved set of proteins (bdlf4.fa):

yoda:cluscheck 157 > emma bdlf4.fa -osformat2 msf -out2 bdlf4.emma -auto

  CLUSTAL W (1.83) Multiple Sequence Alignments

Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: AG876-BDLF4      225 aa
Sequence 2: B95-BDLF4        225 aa
Sequence 3: GD1-BDLF4        225 aa
Sequence 4: RLV-BDLF4        238 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:  100
Sequences (1:3) Aligned. Score:  98
Sequences (1:4) Aligned. Score:  85
Sequences (2:3) Aligned. Score:  98
Sequences (2:4) Aligned. Score:  85
Sequences (3:4) Aligned. Score:  86
Guide tree        file created:   [00029986C]
Start of Multiple Alignment
There are 3 groups
Aligning...
Group 1: Sequences:   2      Score:3770
Group 2: Sequences:   3      Score:3741
Group 3: Sequences:   4      Score:3462
Alignment Score 8058
GCG-Alignment file created      [00029986B]

and now clustalw, unwrapped in emma, with the same input file

yoda:cluscheck 158 > clustalw bdlf4.fa

  CLUSTAL W (1.83) Multiple Sequence Alignments

Sequence format is Pearson
Sequence 1: AG876-BDLF4      225 aa
Sequence 2: B95-BDLF4        225 aa
Sequence 3: GD1-BDLF4        225 aa
Sequence 4: RLV-BDLF4        238 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:  100
Sequences (1:3) Aligned. Score:  98
Sequences (1:4) Aligned. Score:  88
Sequences (2:3) Aligned. Score:  98
Sequences (2:4) Aligned. Score:  88
Sequences (3:4) Aligned. Score:  88
Guide tree        file created:   [bdlf4.dnd]
Start of Multiple Alignment
There are 3 groups
Aligning...
Group 1: Sequences:   2      Score:4959
Group 2: Sequences:   3      Score:4928
Group 3: Sequences:   4      Score:4677
Alignment Score 8187
CLUSTAL-Alignment file created  [bdlf4.aln]

Why is the scoring subtly different?  and see what it does to the 
N-terminal of the alignment....

First with emma:

            1                                               50
AG876-BDLF4 ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
B95-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
GD1-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
RLV-BDLF4   MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLP

now with clustalw:

AG876-BDLF4      MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
B95-BDLF4        MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
GD1-BDLF4        MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
RLV-BDLF4        MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLPESMASVFACW
                  ***:**:*              * ***..  **.********** *:*************

Clustalw alone clearly gives the correct alignment whereas emma is 
wrong.  I thought that emma simply wrapped clustalw for automation, 
but it appears it is doing something else.  Out of a set of 80 
proteins I am trying to pipeline through alignment, emma gives a 
variant result for 7 of them.....

Any thoughts, as always, much appreciated

cheers
Derek


From Marc.Logghe at DEVGEN.com  Wed Mar  8 05:36:56 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Wed, 8 Mar 2006 11:36:56 +0100
Subject: [EMBOSS] Oddcomp behaves oddly ...
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746BAA@ANTARESIA.be.devgen.com>

> Basically what is happening is that there is a check for the 
> length of the sequence being shorter than the window.  It may 
> well be this that is giving the problem. 

This was a perfect diagnosis. It works fine when I make the window size
off one.
But I guess it should not be a problem for oddcomp being the window size
equal (or even larger) to the length of the sequence ? It is a way of
saying: don't bother with window sizes, just take the complete thing.
Could be a nice to have feature.
Thanks David,
Marc


From david at compbio.dundee.ac.uk  Wed Mar  8 05:26:23 2006
From: david at compbio.dundee.ac.uk (David Martin)
Date: Wed, 08 Mar 2006 10:26:23 +0000
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746BA7@ANTARESIA.be.devgen.com>
Message-ID: <C03461CF.1A4BD%david@compbio.dundee.ac.uk>

On 8/3/06 9:00 am, "Marc Logghe" <Marc.Logghe at devgen.com> wrote:

> ... Or rather, how should I use it properly ?
> 
> OK, suppose your run compseq to obtain the frequency for individual
> residues:
> compseq tsw:Q62671 -word 1
> Apparently this example protein sequence is rather rich in leucine (106
> L out of 889).
> 
> In order to detect this leucine bias, a little file was created
> (leu.comp) that had the following content:
> <file leu.comp>
> Word size       1
> Total count     0
> 
> # bias should be detected as 106 > 100
> L       100
> </file leu.comp>
> 
> Oddcomp was run like this:
> oddcomp tsw:Q62671 -infile leu.comp -window 889

Try window 888 (ie shorter than the length of the sequence). There are a
couple of minor bugs in the oddcomp code that I will forward to the team.

Basically what is happening is that there is a check for the length of the
sequence being shorter than the window.  It may well be this that is giving
the problem. 

It is a long time since I wrote this and C is not my usual language so
apologies if this is not a comprehensive answer.

..d

> 
> But the sequece is not reported.
> When I change the L count to 10 in leu.comp it does not work neither.
> Strangely enough, when the default window is taken (30) the sequence is
> reported.
> What is happening here ?
> 
> Regards,
> Marc
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss


From jison at ebi.ac.uk  Wed Mar  8 06:05:33 2006
From: jison at ebi.ac.uk (Jon Ison)
Date: Wed, 8 Mar 2006 11:05:33 -0000 (GMT)
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746BAA@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA6746BAA@ANTARESIA.be.devgen.com>
Message-ID: <56257.84.92.187.247.1141815933.squirrel@webmail.ebi.ac.uk>

Hi Marc

What might be cleaner is if we modify the ACD file so that any window
size bigger than the sequence length is reprompted for.

Also, to add a qualifier to set the window to the sequence length, if
that'd help.

Cheers

Jon


>> Basically what is happening is that there is a check for the
>> length of the sequence being shorter than the window.  It may
>> well be this that is giving the problem.
>
> This was a perfect diagnosis. It works fine when I make the window size
> off one.
> But I guess it should not be a problem for oddcomp being the window size
> equal (or even larger) to the length of the sequence ? It is a way of
> saying: don't bother with window sizes, just take the complete thing.
> Could be a nice to have feature.
> Thanks David,
> Marc
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From Marc.Logghe at DEVGEN.com  Wed Mar  8 06:36:06 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Wed, 8 Mar 2006 12:36:06 +0100
Subject: [EMBOSS] Oddcomp behaves oddly ...
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746BAB@ANTARESIA.be.devgen.com>

Hi David,
I am afraid there are some remaining oddities with oddcomp.
Tried another protein, other residue.
<file compseq.data>
Word size       1
Total count     0

S       4
<file compseq.data>

First a set of sequences is generated (kind of mimicking sliding window)
of length 20:
splitter wormpep:ZK822.4 -size 20 -overlap 19 > split.fa

Second, oddseq is run (with window option off by one):
oddcomp split.fa -window 19 -infile compseq.data 
#
# Output from 'oddcomp'
#
# The Expected frequencies are taken from the file: compseq.data
#
#       Word size: 1
        ZK822.4_36-55
        ZK822.4_37-56
        ZK822.4_38-57
        ZK822.4_39-58
        ZK822.4_40-59
        ZK822.4_41-60

#       END     #

The first 20mer:
>ZK822.4_36-55
SAGSSGSNFLSGLQNSSFGQ

It is clear that there are 7 S residues in this stretch and we were
looking for 4 or more, so that makes sense.
However, when you run oddseq again with S count of 5 instead of 4, no
sequence is reported !
Cheers,
Marc


> -----Original Message-----
> From: David Martin [mailto:david at compbio.dundee.ac.uk] 
> Sent: Wednesday, March 08, 2006 11:26 AM
> To: Marc Logghe; emboss at emboss.open-bio.org
> Subject: Re: [EMBOSS] Oddcomp behaves oddly ...
> 
> On 8/3/06 9:00 am, "Marc Logghe" <Marc.Logghe at devgen.com> wrote:
> 
> > ... Or rather, how should I use it properly ?
> > 
> > OK, suppose your run compseq to obtain the frequency for individual
> > residues:
> > compseq tsw:Q62671 -word 1
> > Apparently this example protein sequence is rather rich in leucine 
> > (106 L out of 889).
> > 
> > In order to detect this leucine bias, a little file was created
> > (leu.comp) that had the following content:
> > <file leu.comp>
> > Word size       1
> > Total count     0
> > 
> > # bias should be detected as 106 > 100
> > L       100
> > </file leu.comp>
> > 
> > Oddcomp was run like this:
> > oddcomp tsw:Q62671 -infile leu.comp -window 889
> 
> Try window 888 (ie shorter than the length of the sequence). 
> There are a couple of minor bugs in the oddcomp code that I 
> will forward to the team.
> 
> Basically what is happening is that there is a check for the 
> length of the sequence being shorter than the window.  It may 
> well be this that is giving the problem. 
> 
> It is a long time since I wrote this and C is not my usual 
> language so apologies if this is not a comprehensive answer.
> 
> ..d
> 
> > 
> > But the sequece is not reported.
> > When I change the L count to 10 in leu.comp it does not 
> work neither.
> > Strangely enough, when the default window is taken (30) the 
> sequence 
> > is reported.
> > What is happening here ?
> > 
> > Regards,
> > Marc
> > 
> > _______________________________________________
> > EMBOSS mailing list
> > EMBOSS at emboss.open-bio.org
> > http://newportal.open-bio.org/mailman/listinfo/emboss
> 
> 
> 


From pmr at ebi.ac.uk  Wed Mar  8 07:09:25 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 08 Mar 2006 12:09:25 +0000
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <C03461CF.1A4BD%david@compbio.dundee.ac.uk>
References: <C03461CF.1A4BD%david@compbio.dundee.ac.uk>
Message-ID: <440EC975.6090907@ebi.ac.uk>

David Martin wrote:

> Basically what is happening is that there is a check for the length of the
> sequence being shorter than the window.  It may well be this that is giving
> the problem. 

Not that part - it accepts a window the same length as the sequence (oddcomp 
can read more than one sequence, and does have to skip those too short to fit 
a window).

A later loop does fail if the window size matches the sequence - I am testing 
allowing it to run just one more time :-)

> It is a long time since I wrote this and C is not my usual language so
> apologies if this is not a comprehensive answer.

Snakke de fortran?

>>But the sequece is not reported.
>>When I change the L count to 10 in leu.comp it does not work neither.
>>Strangely enough, when the default window is taken (30) the sequence is
>>reported.

Same problem I believe - it is the window size matching sequence length that 
stops the last for loop from checking anything.

regadrs,

Peter


From pmr at ebi.ac.uk  Wed Mar  8 08:13:24 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 08 Mar 2006 13:13:24 +0000
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746BAB@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA6746BAB@ANTARESIA.be.devgen.com>
Message-ID: <440ED874.7070100@ebi.ac.uk>

Marc Logghe wrote:

> Hi David,
> I am afraid there are some remaining oddities with oddcomp.
> The first 20mer:
> 
>>ZK822.4_36-55
> 
> SAGSSGSNFLSGLQNSSFGQ
> 
> It is clear that there are 7 S residues in this stretch and we were
> looking for 4 or more, so that makes sense.
> However, when you run oddseq again with S count of 5 instead of 4, no
> sequence is reported !

At least 2 bugs here. Firstly, with more than one sequence as input, some 
internal values were not fully reset. Also the word size is used (as 2) before 
it is set to 1.

For 8 Serines in this set I am still only getting one hit out of two. A little 
more investigation needed ... I am getting closer :-)

regards,

Peter


From ajb at ebi.ac.uk  Thu Mar  9 10:58:33 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 9 Mar 2006 15:58:33 -0000 (GMT)
Subject: [EMBOSS] clustalw vs. emma
In-Reply-To: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>
References: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>
Message-ID: <45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>

Hi Derek,

emma is indeed just a wrapper for clustalw. You can see what default
parameters it is using by specifying -debug on the command line
and then looking at the emma.dbg file. Search for a line
saying "Executing 'clustalw"

I suspect that the default gap extension penalty is rather high
in your case. If you use (e.g.) -gapext 0.2   then you'll get
something approaching the default clustalw behaviour. The defaults
for your sequences seem to be:

  -gapopen=10.000 -gapext=5.000 -gapdist=8


HTH

Alan

> Morning all
>
> Is there some unusual default being passed to emma?  For instance,
> here's emma with a vanilla set of parameters on a fairly well
> conserved set of proteins (bdlf4.fa):
>
> yoda:cluscheck 157 > emma bdlf4.fa -osformat2 msf -out2 bdlf4.emma -auto
>
>   CLUSTAL W (1.83) Multiple Sequence Alignments
>
> Sequence type explicitly set to Protein
> Sequence format is Pearson
> Sequence 1: AG876-BDLF4      225 aa
> Sequence 2: B95-BDLF4        225 aa
> Sequence 3: GD1-BDLF4        225 aa
> Sequence 4: RLV-BDLF4        238 aa
> Start of Pairwise alignments
> Aligning...
> Sequences (1:2) Aligned. Score:  100
> Sequences (1:3) Aligned. Score:  98
> Sequences (1:4) Aligned. Score:  85
> Sequences (2:3) Aligned. Score:  98
> Sequences (2:4) Aligned. Score:  85
> Sequences (3:4) Aligned. Score:  86
> Guide tree        file created:   [00029986C]
> Start of Multiple Alignment
> There are 3 groups
> Aligning...
> Group 1: Sequences:   2      Score:3770
> Group 2: Sequences:   3      Score:3741
> Group 3: Sequences:   4      Score:3462
> Alignment Score 8058
> GCG-Alignment file created      [00029986B]
>
> and now clustalw, unwrapped in emma, with the same input file
>
> yoda:cluscheck 158 > clustalw bdlf4.fa
>
>   CLUSTAL W (1.83) Multiple Sequence Alignments
>
> Sequence format is Pearson
> Sequence 1: AG876-BDLF4      225 aa
> Sequence 2: B95-BDLF4        225 aa
> Sequence 3: GD1-BDLF4        225 aa
> Sequence 4: RLV-BDLF4        238 aa
> Start of Pairwise alignments
> Aligning...
> Sequences (1:2) Aligned. Score:  100
> Sequences (1:3) Aligned. Score:  98
> Sequences (1:4) Aligned. Score:  88
> Sequences (2:3) Aligned. Score:  98
> Sequences (2:4) Aligned. Score:  88
> Sequences (3:4) Aligned. Score:  88
> Guide tree        file created:   [bdlf4.dnd]
> Start of Multiple Alignment
> There are 3 groups
> Aligning...
> Group 1: Sequences:   2      Score:4959
> Group 2: Sequences:   3      Score:4928
> Group 3: Sequences:   4      Score:4677
> Alignment Score 8187
> CLUSTAL-Alignment file created  [bdlf4.aln]
>
> Why is the scoring subtly different?  and see what it does to the
> N-terminal of the alignment....
>
> First with emma:
>
>             1                                               50
> AG876-BDLF4 ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> B95-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> GD1-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> RLV-BDLF4   MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLP
>
> now with clustalw:
>
> AG876-BDLF4
> MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> B95-BDLF4
> MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> GD1-BDLF4
> MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> RLV-BDLF4
> MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLPESMASVFACW
>                   ***:**:*              * ***..  **.**********
> *:*************
>
> Clustalw alone clearly gives the correct alignment whereas emma is
> wrong.  I thought that emma simply wrapped clustalw for automation,
> but it appears it is doing something else.  Out of a set of 80
> proteins I am trying to pipeline through alignment, emma gives a
> variant result for 7 of them.....
>
> Any thoughts, as always, much appreciated
>
> cheers
> Derek
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From d.gatherer at vir.gla.ac.uk  Thu Mar  9 11:18:55 2006
From: d.gatherer at vir.gla.ac.uk (Derek Gatherer)
Date: Thu, 09 Mar 2006 16:18:55 +0000
Subject: [EMBOSS] clustalw vs. emma
In-Reply-To: <45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>
References: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>
	<45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>
Message-ID: <6.2.3.4.1.20060309160317.02abb870@lenzie.gla.ac.uk>

Thanks Alan

That indeed is the cause of the problem:

Executing 'clustalw -infile=00052348A -outfile=00052348B -align 
-type=protein -o
utput=gcg -pwmatrix=blosum -pwgapopen=10.000 -pwgapext=0.100 
-newtree=00052348C
-matrix=blosum -gapopen=10.000 -gapext=5.000 -gapdist=8 
-hgapresidues=GPSNDQEKR
-maxdiv=30'

However, on attempting to manually specify it, I run into another one:

[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -pwgapextend 5
Died: Unknown qualifier -pwgapextend

In the docs http://emboss.sourceforge.net/apps/cvs/emma.html, there 
are quite a few optional parameters of this sort, some of which work 
and others don't, eg:

[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -gapextend 5
Died: Unknown qualifier -gapextend
[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -pwgapextend 5
Died: Unknown qualifier -pwgapextend
[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -gapopen 5
Died: Unknown qualifier -gapopen
[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -gapdist 5

CLUSTAL W (1.83) Multiple Sequence Alignments

so -gapdist works at least.

Cheers
Derek


At 15:58 09/03/2006, ajb at ebi.ac.uk wrote:
>Hi Derek,
>
>emma is indeed just a wrapper for clustalw. You can see what default
>parameters it is using by specifying -debug on the command line
>and then looking at the emma.dbg file. Search for a line
>saying "Executing 'clustalw"
>
>I suspect that the default gap extension penalty is rather high
>in your case. If you use (e.g.) -gapext 0.2   then you'll get
>something approaching the default clustalw behaviour. The defaults
>for your sequences seem to be:
>
>   -gapopen=10.000 -gapext=5.000 -gapdist=8
>
>
>HTH
>
>Alan
>
> > Morning all
> >
> > Is there some unusual default being passed to emma?  For instance,
> > here's emma with a vanilla set of parameters on a fairly well
> > conserved set of proteins (bdlf4.fa):
> >
> > yoda:cluscheck 157 > emma bdlf4.fa -osformat2 msf -out2 bdlf4.emma -auto
> >
> >   CLUSTAL W (1.83) Multiple Sequence Alignments
> >
> > Sequence type explicitly set to Protein
> > Sequence format is Pearson
> > Sequence 1: AG876-BDLF4      225 aa
> > Sequence 2: B95-BDLF4        225 aa
> > Sequence 3: GD1-BDLF4        225 aa
> > Sequence 4: RLV-BDLF4        238 aa
> > Start of Pairwise alignments
> > Aligning...
> > Sequences (1:2) Aligned. Score:  100
> > Sequences (1:3) Aligned. Score:  98
> > Sequences (1:4) Aligned. Score:  85
> > Sequences (2:3) Aligned. Score:  98
> > Sequences (2:4) Aligned. Score:  85
> > Sequences (3:4) Aligned. Score:  86
> > Guide tree        file created:   [00029986C]
> > Start of Multiple Alignment
> > There are 3 groups
> > Aligning...
> > Group 1: Sequences:   2      Score:3770
> > Group 2: Sequences:   3      Score:3741
> > Group 3: Sequences:   4      Score:3462
> > Alignment Score 8058
> > GCG-Alignment file created      [00029986B]
> >
> > and now clustalw, unwrapped in emma, with the same input file
> >
> > yoda:cluscheck 158 > clustalw bdlf4.fa
> >
> >   CLUSTAL W (1.83) Multiple Sequence Alignments
> >
> > Sequence format is Pearson
> > Sequence 1: AG876-BDLF4      225 aa
> > Sequence 2: B95-BDLF4        225 aa
> > Sequence 3: GD1-BDLF4        225 aa
> > Sequence 4: RLV-BDLF4        238 aa
> > Start of Pairwise alignments
> > Aligning...
> > Sequences (1:2) Aligned. Score:  100
> > Sequences (1:3) Aligned. Score:  98
> > Sequences (1:4) Aligned. Score:  88
> > Sequences (2:3) Aligned. Score:  98
> > Sequences (2:4) Aligned. Score:  88
> > Sequences (3:4) Aligned. Score:  88
> > Guide tree        file created:   [bdlf4.dnd]
> > Start of Multiple Alignment
> > There are 3 groups
> > Aligning...
> > Group 1: Sequences:   2      Score:4959
> > Group 2: Sequences:   3      Score:4928
> > Group 3: Sequences:   4      Score:4677
> > Alignment Score 8187
> > CLUSTAL-Alignment file created  [bdlf4.aln]
> >
> > Why is the scoring subtly different?  and see what it does to the
> > N-terminal of the alignment....
> >
> > First with emma:
> >
> >             1                                               50
> > AG876-BDLF4 ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> > B95-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> > GD1-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> > RLV-BDLF4   MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLP
> >
> > now with clustalw:
> >
> > AG876-BDLF4
> > MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> > B95-BDLF4
> > MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> > GD1-BDLF4
> > MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> > RLV-BDLF4
> > MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLPESMASVFACW
> >                   ***:**:*              * ***..  **.**********
> > *:*************
> >
> > Clustalw alone clearly gives the correct alignment whereas emma is
> > wrong.  I thought that emma simply wrapped clustalw for automation,
> > but it appears it is doing something else.  Out of a set of 80
> > proteins I am trying to pipeline through alignment, emma gives a
> > variant result for 7 of them.....
> >
> > Any thoughts, as always, much appreciated
> >
> > cheers
> > Derek
> > _______________________________________________
> > EMBOSS mailing list
> > EMBOSS at emboss.open-bio.org
> > http://newportal.open-bio.org/mailman/listinfo/emboss
> >


From pmr at ebi.ac.uk  Thu Mar  9 12:01:15 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 09 Mar 2006 17:01:15 +0000
Subject: [EMBOSS] clustalw vs. emma
In-Reply-To: <6.2.3.4.1.20060309160317.02abb870@lenzie.gla.ac.uk>
References: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>	<45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>
	<6.2.3.4.1.20060309160317.02abb870@lenzie.gla.ac.uk>
Message-ID: <44105F5B.3050200@ebi.ac.uk>

Derek Gatherer wrote:

> In the docs http://emboss.sourceforge.net/apps/cvs/emma.html, there 
> are quite a few optional parameters of this sort, some of which work 
> and others don't, eg:

Yup - we're putting that right (some people have noticed the application docs 
are moving around).

The emboss.sf.net website only documents things for the latest code in CVS. We 
are adding documentation for release 3.0.0 (that is why the new directories 
are appearing).

The release 3.0.0 documentation is installed on your system when you install 
3.0.0 - if you install to /usr/local/bin it will be in:

/usr/local/share/EMBOSS/doc/programs/html (this will change in release 4.0.0).

You are seeing some of the changes made to make standard names for command 
line qualifiers since 3.0.0

Hope that helps,

Peter


From blanchard at microbio.umass.edu  Thu Mar  9 16:18:55 2006
From: blanchard at microbio.umass.edu (Jeffrey Blanchard)
Date: Thu, 9 Mar 2006 16:18:55 -0500
Subject: [EMBOSS] d_ino
Message-ID: <C2354214-A937-4FFE-BCD4-0562F896482D@microbio.umass.edu>

Hello,

I am trying to install EMBOSS under cygwin for teaching purposes.

make crashes on ajfile because d_ino appears to be missing in current  
version of cygwin.

Is there a work around for this?

Thanks, Jeff

-------------------------------
Jeffrey L. Blanchard
Assistant Professor
Department of Microbiology
University of Massachusetts
Amherst, MA 01003
Office and Lab: Morrill I N330
Tel: 413-577-2130
Fax: 413-545-1578
http://www.bio.umass.edu/micro/blanchard/Lab_About.html


From ajb at ebi.ac.uk  Thu Mar  9 19:22:45 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 10 Mar 2006 00:22:45 -0000 (GMT)
Subject: [EMBOSS] d_ino
In-Reply-To: <C2354214-A937-4FFE-BCD4-0562F896482D@microbio.umass.edu>
References: <C2354214-A937-4FFE-BCD4-0562F896482D@microbio.umass.edu>
Message-ID: <41243.81.96.70.96.1141950165.squirrel@webmail.ebi.ac.uk>

Hi,

Yes indeed there is a fix. Look in the directory.

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/

The README file there will usually tell you what each of the files
fixes.

HTH

Alan Bleasby
EBI


> Hello,
>
> I am trying to install EMBOSS under cygwin for teaching purposes.
>
> make crashes on ajfile because d_ino appears to be missing in current
> version of cygwin.
>
> Is there a work around for this?
>
> Thanks, Jeff
>
> -------------------------------
> Jeffrey L. Blanchard
> Assistant Professor
> Department of Microbiology
> University of Massachusetts
> Amherst, MA 01003
> Office and Lab: Morrill I N330
> Tel: 413-577-2130
> Fax: 413-545-1578
> http://www.bio.umass.edu/micro/blanchard/Lab_About.html
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From jison at ebi.ac.uk  Wed Mar 15 12:09:59 2006
From: jison at ebi.ac.uk (Jon Ison)
Date: Wed, 15 Mar 2006 17:09:59 -0000 (GMT)
Subject: [EMBOSS] EMBOSS Developers Course - reminder
Message-ID: <39760.172.31.70.94.1142442599.squirrel@webmail.ebi.ac.uk>

Hi

There's still some places left on this course.
Get in touch if you'd like to attend.

Cheers

Jon


BSDC 2006
Bioinformatics Software Development Course
April 18-20 2006

Following from the highly successful BSDC 2003/2004 courses, a new
series of courses on 'Bioinformatics Software Development' using
EMBOSS will be held in the training room at The Wellcome Trust Conference
Centre on April 18-20, 2006.

The course will give a good introduction to programming in EMBOSS.
By the end of the course you will be experienced in all the steps in
writing a basic bioinformatics application using the EMBOSS
programming libraries.

The course would suit competent programmers, probably with at least a
couple of years of experience. A reasonable working knowledge of C is
required to get the most out of the course, familiarity with pointers
is helpful but not essential. That said, all are welcome regardless
of background or experience.

Places are limited so please email Liz Ford (ford at ebi.ac.uk) to register
as soon as possible.

We do not make a profit on the course but must charge #125 / person
(for the 3-days) to recover some of our costs.
We are unable to take credit card payments. The preferred method
of payment is by cheque made payable to 'Industry Workshops'.
If you wish to pay in cash or by bank transfer please contact Liz Ford
(ford at ebi.ac.uk)

To read more about the course see
http://emboss.sourceforge.net/developers/developers_course/
To read more about EMBOSS see
http://emboss.sourceforge.net/

To register:
email Liz Ford (ford at ebi.ac.uk)
with your full name, address, phone number
You will then receive an email back confirming your registration or not.
Please note, as mentioned before, places are limited so not all registrations
will be successful.

For further information
email Jon Ison (jison at ebi.ac.uk)


From pmr at ebi.ac.uk  Mon Mar 27 12:50:09 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Mon, 27 Mar 2006 18:50:09 +0100 (BST)
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <00b501c651be$95b37500$e6028a0a@GOLHARMOBILE1>
References: <00b501c651be$95b37500$e6028a0a@GOLHARMOBILE1>
Message-ID: <2253.86.132.217.176.1143481809.squirrel@webmail.ebi.ac.uk>

Ryan Golhar wrote:

> I have a BLAST alignment: query sequence and database sequence.
>
> The alignment is only showing the HSP from the blast output as expected,
> however I want to build an alignment of the entire database sequence
> against my query sequence.
>
> I tried using needle from EMBOSS, however its aligning the sequences
> completely different than BLAST does.  What I'd really like is a way to
> anchor the alignment based on the BLAST HSP.  Does anyone know how to do
> this, or what tool(s) will allow me to do this?

You are quite right that EMBOSS may align the sequences completely
differently - unless the HSPs are very significant and cover most of the
sequence this will be true of any attempt to simply realign. There has to
be some way to pass on the HSPs as fixed positions, as in the BioPerl
solution.

However, it could make a nice EMBOSS application - the only question would
be how you would like to specify the HSPs. Perhaps we could read BLAST
output (in some specified format), or perhaps some other way to give the
input alignments.

We do have at least one EMBOSS application that does something similar
(finds all long perfect matches and interpolates) - we just need to reuse
the interpolation code which is basically doing a global alignment of the
bits in between. That also tackles the problem of choosing which
non-compatible initial matches to use.

Hope that helps,

Peter


From golharam at umdnj.edu  Mon Mar 27 11:50:42 2006
From: golharam at umdnj.edu (Ryan Golhar)
Date: Mon, 27 Mar 2006 11:50:42 -0500
Subject: [EMBOSS] Building an alignment from BLAST hsp
Message-ID: <00b501c651be$95b37500$e6028a0a@GOLHARMOBILE1>

I have a BLAST alignment: query sequence and database sequence.  

The alignment is only showing the HSP from the blast output as expected,
however I want to build an alignment of the entire database sequence
against my query sequence.  

I tried using needle from EMBOSS, however its aligning the sequences
completely different than BLAST does.  What I'd really like is a way to
anchor the alignment based on the BLAST HSP.  Does anyone know how to do
this, or what tool(s) will allow me to do this?

Ryan


From golharam at umdnj.edu  Mon Mar 27 13:03:39 2006
From: golharam at umdnj.edu (Ryan Golhar)
Date: Mon, 27 Mar 2006 13:03:39 -0500
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <2253.86.132.217.176.1143481809.squirrel@webmail.ebi.ac.uk>
Message-ID: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1>

Hi Peter,

> You are quite right that EMBOSS may align the sequences completely 
> differently - unless the HSPs are very significant and cover most 
> of the sequence this will be true of any attempt to simply realign. 
> There has to be some way to pass on the HSPs as fixed positions, 
> as in the BioPerl solution.

I looked at a bioperl method, but can't seem to find something that will
accomplish this.  

> However, it could make a nice EMBOSS application - the only question 
> would be how you would like to specify the HSPs. Perhaps we could read

> BLAST output (in some specified format), or perhaps some other way to 
> give the input alignments.

Yes, I agree.  I suppose the best way would be to specify the two
sequences and the blast output.  The application could then construct an
alignment based on a particular HSP (probably the first one, or whatever
the user specifies).

Ryan


From letondal at pasteur.fr  Tue Mar 28 02:25:07 2006
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue, 28 Mar 2006 09:25:07 +0200
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1>
References: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1>
Message-ID: <4b91818a096ba42d8d53279a7f63e6ea@pasteur.fr>


On Mar 27, 2006, at 8:03 PM, Ryan Golhar wrote:

> Hi Peter,
>
>> You are quite right that EMBOSS may align the sequences completely
>> differently - unless the HSPs are very significant and cover most
>> of the sequence this will be true of any attempt to simply realign.
>> There has to be some way to pass on the HSPs as fixed positions,
>> as in the BioPerl solution.
>
> I looked at a bioperl method, but can't seem to find something that 
> will
> accomplish this.
>
>> However, it could make a nice EMBOSS application - the only question
>> would be how you would like to specify the HSPs. Perhaps we could read
>
>> BLAST output (in some specified format), or perhaps some other way to
>> give the input alignments.
>
> Yes, I agree.  I suppose the best way would be to specify the two
> sequences and the blast output.  The application could then construct 
> an
> alignment based on a particular HSP (probably the first one, or 
> whatever
> the user specifies).
>

Have you tried this:
http://bioweb.pasteur.fr/seqanal/interfaces/seqsblast.html

It is based on bioperl. check "Get HSP" option (you can even extend it).

Best,

--
Catherine Letondal -- Institut Pasteur -- Computing Center


From cquijano at iib.uam.es  Tue Mar 28 04:49:01 2006
From: cquijano at iib.uam.es (Carlos Quijano)
Date: Tue, 28 Mar 2006 11:49:01 +0200
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <4b91818a096ba42d8d53279a7f63e6ea@pasteur.fr>
References: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1> 
	<4b91818a096ba42d8d53279a7f63e6ea@pasteur.fr>
Message-ID: <1143539342.8611.45.camel@localhost.localdomain>

Hi all,

I didnt read it before, sorry for the "lapsus". And sorry for the
information if what I tell you is not exactly what you needed, Ryan.

What you are looking for is just _MVIEW_, an old but nice application.
Use scholar.google.com / pubmed to find more information about it, I
remember that there are web servers running cgi's somewhere. It is
possible than during this last years, somebody has published a new
better tool or a new mview version.... Look for it.

MVIEW is a parser for your blast output.
MVIEW works for your problem because you wanna align only one sequence
(as a template) to a entire database (I suppose that with any cutoff in
the e-value or p-vale, at least the default, it is, ten) or against a
set of some sequences or only one more sequence (2 sequences alignment).

I continue with some considerations about aligning HSPs from Blast the
way you pretend and mview does... there are important considerations and
it is only a minute to read:
Remember, what you get is what you wanted, but not a real thing (this is
something very typical in bioinformatics - and all science - hahaha).
You dont get a real multiple alignment, you get an artifact that is a
entire database's gene-blast.hsps constructs piled down a template gene
(your sequence). 
All right then. You dont have by any means an alignment, nor even an
alignment of the genes using HSPs, because, there can be some hsps
alignable between sequences in the database that are hidden for the
alignment when sequences are piled down your sequence, because your
sequence lacks this hsps and are _ignored_. 
Why is this so important?
What I actually mean is that if you use this "sequences piled down a
template" as a multiple alignment, you will be lying about the topology
underlying (it is, not lying ;-) in the gene network, that arises from
your database plus your sequence when correctly aligned, it is, all
against all... etc,etc, etc.
Well, it is the mathematical exhaustive-optimal way... normally we use
heuristics again, and again, and again... But "all against all" is the
key concept involved in the multiple alignment problem. It is very
important to be aware of this things.
needle is the optimal way <-> Blast is the heuristic
Clustal is also a very very heuristic solution to the massive problem of
multiple alignment. And personally I prefer to use muscle that uses a
better mathematical model and is (right now) the quickest aligner for
the most of the cases.

I am sure that most of you know it. 
I hope it is usefull for newbies and others, so forgive me for the
boring tedious discourse...


CQ

El mar, 28-03-2006 a las 09:25 +0200, Catherine Letondal escribi?:

> On Mar 27, 2006, at 8:03 PM, Ryan Golhar wrote:
> 
> > Hi Peter,
> >
> >> You are quite right that EMBOSS may align the sequences completely
> >> differently - unless the HSPs are very significant and cover most
> >> of the sequence this will be true of any attempt to simply realign.
> >> There has to be some way to pass on the HSPs as fixed positions,
> >> as in the BioPerl solution.
> >
> > I looked at a bioperl method, but can't seem to find something that 
> > will
> > accomplish this.
> >
> >> However, it could make a nice EMBOSS application - the only question
> >> would be how you would like to specify the HSPs. Perhaps we could read
> >
> >> BLAST output (in some specified format), or perhaps some other way to
> >> give the input alignments.
> >
> > Yes, I agree.  I suppose the best way would be to specify the two
> > sequences and the blast output.  The application could then construct 
> > an
> > alignment based on a particular HSP (probably the first one, or 
> > whatever
> > the user specifies).
> >
> 
> Have you tried this:
> http://bioweb.pasteur.fr/seqanal/interfaces/seqsblast.html
> 
> It is based on bioperl. check "Get HSP" option (you can even extend it).
> 
> Best,
> 
> --
> Catherine Letondal -- Institut Pasteur -- Computing Center
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss

Carlos Quijano
http://www2.iib.uam.es/cquijano
Evolution and Development laboratory
Regulation of Gene Expression Department
Institute for Biomedical Research
http://www.iib.uam.es


From kvddrift at earthlink.net  Wed Mar 29 19:36:23 2006
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Wed, 29 Mar 2006 19:36:23 -0500
Subject: [EMBOSS] crash on intel-Mac
Message-ID: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>

Hi,

I got a report from a user (of the fink package of emboss) that the  
following crashes occur on his Mac with an intel processor:

% wossname
Error: Failed to compile regular expression '^(.*/)[^/]+/?$' at  
position 716: range out of order in character class
Bus error

All other programs just give a bus error.


I don't get these errors on a Mac with a PowerPC processor.

This is emboss 3.0.0.


- Koen.


From areagp61 at yahoo.it  Thu Mar 30 03:31:42 2006
From: areagp61 at yahoo.it (Graziano P.)
Date: Thu, 30 Mar 2006 10:31:42 +0200 (CEST)
Subject: [EMBOSS] dbifasta index file format
Message-ID: <20060330083142.4237.qmail@web26207.mail.ukl.yahoo.com>

hello EMBOSS users,
I have some databases in fasta format (ncbi | format)
and I want to index them using dbifasta, then I want
to access the index files using a program that will be
developed by a computer scientist of my group.
I need to index the databases by accession number,
ginumber and description. I have read in the dbifasta
help info about the structure of the index files when
the databases were indexed by accession number, but I
have not found info about the structure of the index
files when the databases are indexed by description.
Anyone knows where I can find detailed information
about the structure of the index files?

Regards
Graziano


___________________________________ 
Yahoo! Messenger with Voice: chiama da PC a telefono a tariffe esclusive 
http://it.messenger.yahoo.com


From ajb at ebi.ac.uk  Thu Mar 30 03:38:10 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 30 Mar 2006 09:38:10 +0100 (BST)
Subject: [EMBOSS] crash on intel-Mac
In-Reply-To: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
References: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
Message-ID: <37407.81.98.244.247.1143707890.squirrel@webmail.ebi.ac.uk>

Hi,

Thanks. We already have a report of this and are working on a
solution.

Alan


From gbottu at ben.vub.ac.be  Thu Mar 30 04:37:23 2006
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Thu, 30 Mar 2006 11:37:23 +0200
Subject: [EMBOSS] A note about fastA format(s) - Checked by AntiVir DEMO
	version -
Message-ID: <20060330093723.GA18690@bigben.ulb.ac.be>

	Dear friends,

We are using EMBOSS version 3.0. One of my colleagues tried to use a 
multiple sequence file in fastA format, where each comment line starts 
with a string containing multiple pipe signs. An USA of type
fasta::file:xx|yy|zz|uu|ss
did not work. After some trial I found that putting "pearson" instead of 
"fasta" helped. This is strange, since according to the on-line manual at 
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
"fasta" and "pearson" are synonyms. Here it seems that "fasta" is instead 
treated the same as "ncbi". Comments ?

	Guy Bottu,
	BEN
 

From enrique.deandres at pcm.uam.es  Thu Mar 30 10:46:30 2006
From: enrique.deandres at pcm.uam.es (Enrique de Andres Saiz)
Date: Thu, 30 Mar 2006 17:46:30 +0200
Subject: [EMBOSS] Problem indexing PDB fasta file
Message-ID: <442BFD56.9010908@pcm.uam.es>

Hello,

I'm trying to index the fasta file of the PDB database with dbifasta 
command and I get a lot of warnings as:

Warning: Duplicate ID skipped: '1FNT_A' All hits will point to first ID 
found

I have been looking the PDB fasta file and I see that, for the previous 
warning, there are an entry whoose id is '1FNT_A' and another one whoose 
id is '1FNT_a'. Then, this make me think that EMBOSS is 
case-insensitive. Is this true? Are there any way to distinguish between 
the two id's?

Thanks in advance,

Enrique.


From pmr at ebi.ac.uk  Thu Mar 30 16:47:19 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Thu, 30 Mar 2006 22:47:19 +0100 (BST)
Subject: [EMBOSS] A note about fastA format(s) - Checked by AntiVir DEMO
 version -
In-Reply-To: <20060330093723.GA18690@bigben.ulb.ac.be>
References: <20060330093723.GA18690@bigben.ulb.ac.be>
Message-ID: <50335.68.153.173.207.1143755239.squirrel@webmail.ebi.ac.uk>

Dear Guy,

> We are using EMBOSS version 3.0. One of my colleagues tried to use a
> multiple sequence file in fastA format, where each comment line starts
> with a string containing multiple pipe signs. An USA of type
> fasta::file:xx|yy|zz|uu|ss
> did not work. After some trial I found that putting "pearson" instead of
> "fasta" helped. This is strange, since according to the on-line manual at
> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
> "fasta" and "pearson" are synonyms. Here it seems that "fasta" is instead
> treated the same as "ncbi". Comments ?

Yes, that is indeed true. We had to make chanhes to support various NCBI
formats, and made FASTA and NCBI the same. We kept "pearson" as the
original plain fasta format.

We will update the documentation - it will take a little time to check for
any other changes to the formats.

regards,

Peter


From ajb at ebi.ac.uk  Fri Mar 31 07:12:53 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 31 Mar 2006 13:12:53 +0100 (BST)
Subject: [EMBOSS] crash on intel-Mac
In-Reply-To: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
References: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
Message-ID: <51078.81.98.244.247.1143807173.squirrel@webmail.ebi.ac.uk>

This should now be fixed as long as you apply all the fixes to EMBOSS-3.0.0
from the directory:

    ftp://emboss.open-bio.org/pub/EMBOSS/fixes/

The latest file there is a new 'configure' however, if you've not
applied previous patches in the above directory as well, then you'll get
compilation failure. Look at the README for details of what the
patches fix.

Thanks to Bill van Etten for previous emails on this.

Changes to the CVS developers version will follow.

Alan


From dksamuel at gmail.com  Fri Mar 31 23:12:14 2006
From: dksamuel at gmail.com (Duleep Samuel)
Date: Sat, 1 Apr 2006 09:42:14 +0530
Subject: [EMBOSS] Fwd: EMBOSS for Windows without Cygwin
In-Reply-To: <442CCD71.60202@gmail.com>
References: <442CCD71.60202@gmail.com>
Message-ID: <a0bf33d50603312012yd77e73ex9e5f88b3acc10e97@mail.gmail.com>

Is the latest EMBOSS version 3.0.0.0 available anywhere as a precompiled
binary for Windows  XP,  I have tried  compiling  using cygwin and it
crashed, I loaded EMBOSS for windows which is a port of version 2.10.0,
loaded Staden Package and made Spin aware of EMBOSS and am working, but
feel bad that I am _One_ whole release behind, If anyone has a complied
binary I can download for testing and report back on useability,
regards, Samuel, Virologist, India


From aengus.stewart at cancer.org.uk  Thu Mar  2 14:56:25 2006
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Thu, 02 Mar 2006 14:56:25 +0000
Subject: [EMBOSS] DB - finding out how many sequences
Message-ID: <44070799.6090804@cancer.org.uk>


Hi,

Does any of the EMBOSS apps output the number of sequences that it has searched?

I am after this figure as I have a data library issue.

Some sequences are "not found" by EMBOSS even though I know they are in the original flat files.

I am trying to figure out if this is

configure problem
data problem
indexing problem.

The indexing with dbiflat doesnt complain but I would like to be able to check my input number of sequences with what EMBOSS thinks was output.


Cheers
Aengus


-- 
-----------------------------------------------------------------------
Aengus Stewart
Group Leader
Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.


From jison at ebi.ac.uk  Thu Mar  2 16:35:35 2006
From: jison at ebi.ac.uk (Jon Ison)
Date: Thu, 2 Mar 2006 16:35:35 -0000 (GMT)
Subject: [EMBOSS] DB - finding out how many sequences
In-Reply-To: <44070799.6090804@cancer.org.uk>
References: <44070799.6090804@cancer.org.uk>
Message-ID: <43713.172.31.100.168.1141317335.squirrel@webmail.ebi.ac.uk>

Ay up Aengus

Not so far as I'm aware although you could get that number
indirectly by using infoseq.

You could try using dbxflat too ... which does generate some
stats on the input data -  don't know whether the stats include
the number of sequences that were indexed but its worth a look.

Cheers

Jon


>
> Hi,
>
> Does any of the EMBOSS apps output the number of sequences that it has searched?
>
> I am after this figure as I have a data library issue.
>
> Some sequences are "not found" by EMBOSS even though I know they are in the original flat files.
>
> I am trying to figure out if this is
>
> configure problem
> data problem
> indexing problem.
>
> The indexing with dbiflat doesnt complain but I would like to be able to check my input number of sequences with what
> EMBOSS thinks was output.
>
>
> Cheers
> Aengus
>
>
> --
> -----------------------------------------------------------------------
> Aengus Stewart
> Group Leader
> Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
> Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
> -----------------------------------------------------------------------
>
> This electronic message contains information which may be privileged and
> confidential.  The information is intended to be for the use of the
> individual(s) or entity named above. Be aware that any third party
> disclosure, distribution, copying or use of this communication, without
> prior permission, is strictly prohibited.
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From pmr at ebi.ac.uk  Thu Mar  2 17:39:07 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 02 Mar 2006 17:39:07 +0000
Subject: [EMBOSS] DB - finding out how many sequences
In-Reply-To: <44070799.6090804@cancer.org.uk>
References: <44070799.6090804@cancer.org.uk>
Message-ID: <44072DBB.1060506@ebi.ac.uk>

Hi Aengus,

> Does any of the EMBOSS apps output the number of sequences that it has searched?
> 
> I am after this figure as I have a data library issue.
> 
> Some sequences are "not found" by EMBOSS even though I know they are in the original flat files.

What is your database definition?

regards,

Peter


From joanne at bioinformatics.ubc.ca  Mon Mar  6 22:51:00 2006
From: joanne at bioinformatics.ubc.ca (Joanne Fox)
Date: Mon, 06 Mar 2006 14:51:00 -0800
Subject: [EMBOSS] Warning: Cannot open division file '<null>' for database
	'swissprot'
Message-ID: <440CBCD4.8030605@bioinformatics.ubc.ca>

Hello EMBOSS community,

I used dbiflat to index the latest flatfile distribution of swissprot 
(uniprot_sprot.dat).  Now I am trying to use this database with the 
EMBOSS patmatdb program and I'm encountering an error that reads, 
"Warning: Cannot open division file '<null>' for database 'swissprot'".

I searched the mailing list archives and I see others with this same 
problem that boil down to permissions and/or path problems.  However, I 
still can't figure out what's going wrong on my system.  I've put more 
detailed information below.  I'm new to the world of configuring EMBOSS 
so if anyone has any ideas about what might be going wrong, I'd really 
appreciate the advice.

Thanks,
Joanne.
-- 
| Joanne Fox
| http://bioinformatics.ubc.ca/people/joanne


~> showdb
Displays information on the currently available databases
# Name        Type ID  Qry All Comment
# ====        ==== ==  === === =======
swissprot     P    OK  OK  OK  Swissprot Release 7.1, 2/21/2006

~> patmatdb
Search a protein sequence with a motif
Input sequence(s): swissprot
Warning: Cannot open division file '<null>' for database 'swissprot'
Error: Unable to read sequence 'swissprot'
Input sequence(s): swissprot:*
Warning: Cannot open division file '<null>' for database 'swissprot'
Error: Unable to read sequence 'swissprot:*'
Died: patmatdb terminated: Bad value for '-sequence' and no more retries

contents of .embossrc file:
set emboss_logfile /usr/local/software/bioinformatics/emboss/log/emboss.log
set emboss_database_dir /raid1/bioinformatics/data
DB swissprot [
    type: P
    method: emblcd
    format: swissprot
    dir: \$emboss_database_dir/swissprot/swissprot_V7_1
    file: "*.dat"
    release: "7.1"
    comment: "Swissprot Release 7.1, 2/21/2006"
]

contents of the /raid1/bioinformatics/data/swissprot/swissprot_V7_1/ 
directory:
-rw-r--r--    1 bin      bin       1165524 Mar  6 12:37 acnum.hit
-rw-r--r--    1 bin      bin       3899118 Mar  6 12:37 acnum.trg
-rw-r--r--    1 bin      bin           322 Mar  6 12:37 division.lkp
-rw-r--r--    1 bin      bin       4368405 Mar  6 12:37 entrynam.idx
-rw-r--r--    1 bin      bin      802445434 Mar  6 12:23 uniprot_sprot.dat


From yezhiqiang at gmail.com  Mon Mar  6 21:29:49 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Tue, 7 Mar 2006 05:29:49 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein sequence?
Message-ID: <34198fe40603061329u85c9f95p@mail.gmail.com>

Dear all,

      Does emboss have a handy way for mutate a protein sequence by
the specified way?
For example, I have a sequence foo.fasta

>foo
MATSCGLLKIIQRE

It has a mutant called  'A2L'. Is there any way to do this operation
to output(with an option to check the foo.fasta has 'A' at position
2):
>foo A2L
MLTSCGLLKIIQRE

My way:  use extractseq to extract two file: one before position 2,
the other after postion 2. Then creat a fasta file contain 'L'.  After
that,  I use union to connect these 3 sequence file in to one.

Or write a perl script to do this by change a string's substring.

How If emboss could provide a 'mutate' !

Thank you :)

Best regards!
--
Zhiqiang Ye


From Marc.Logghe at DEVGEN.com  Tue Mar  7 07:59:27 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Tue, 7 Mar 2006 08:59:27 +0100
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746B99@ANTARESIA.be.devgen.com>

Can msbar do something for you ? Msbar = "Mutate sequence beyond all
recognition"

Cheers,
Marc


> -----Original Message-----
> From: emboss-bounces at emboss.open-bio.org 
> [mailto:emboss-bounces at emboss.open-bio.org] On Behalf Of Zhiqiang Ye
> Sent: Monday, March 06, 2006 10:30 PM
> To: emboss at emboss.open-bio.org
> Subject: [EMBOSS] Does emboss have a handy way for mutate a 
> protein sequence?
> 
> Dear all,
> 
>       Does emboss have a handy way for mutate a protein 
> sequence by the specified way?
> For example, I have a sequence foo.fasta
> 
> >foo
> MATSCGLLKIIQRE
> 
> It has a mutant called  'A2L'. Is there any way to do this 
> operation to output(with an option to check the foo.fasta has 
> 'A' at position
> 2):
> >foo A2L
> MLTSCGLLKIIQRE
> 
> My way:  use extractseq to extract two file: one before 
> position 2, the other after postion 2. Then creat a fasta 
> file contain 'L'.  After that,  I use union to connect these 
> 3 sequence file in to one.
> 
> Or write a perl script to do this by change a string's substring.
> 
> How If emboss could provide a 'mutate' !
> 
> Thank you :)
> 
> Best regards!
> --
> Zhiqiang Ye
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
> 


From David.Bauer at SCHERING.DE  Tue Mar  7 06:40:24 2006
From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE)
Date: Tue, 7 Mar 2006 07:40:24 +0100
Subject: [EMBOSS] Antwort: Does emboss have a handy way for mutate a protein
 sequence?
In-Reply-To: <34198fe40603061329u85c9f95p@mail.gmail.com>
Message-ID: <OFDFF9CC1E.C290B537-ONC125712A.00246F55-C125712A.0024A951@schering.de>


What about this solution:

cutseq foo.fasta -from 2 -to 2 | pasteseq -filter -pos 1 -bs asis:'L' |
descseq -filter -append -desc "A2L"

>foo A2L
MLTSCGLLKIIQRE

Cheers,
David.


emboss-bounces at emboss.open-bio.org schrieb am 06/03/2006 22:29:49:

> Dear all,
>
>       Does emboss have a handy way for mutate a protein sequence by
> the specified way?
> For example, I have a sequence foo.fasta
>
> >foo
> MATSCGLLKIIQRE
>
> It has a mutant called  'A2L'. Is there any way to do this operation
> to output(with an option to check the foo.fasta has 'A' at position
> 2):
> >foo A2L
> MLTSCGLLKIIQRE
>
> My way:  use extractseq to extract two file: one before position 2,
> the other after postion 2. Then creat a fasta file contain 'L'.  After
> that,  I use union to connect these 3 sequence file in to one.
>
> Or write a perl script to do this by change a string's substring.
>
> How If emboss could provide a 'mutate' !
>
> Thank you :)
>
> Best regards!
> --
> Zhiqiang Ye
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss


From yezhiqiang at gmail.com  Tue Mar  7 13:22:57 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Tue, 7 Mar 2006 21:22:57 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746B99@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA6746B99@ANTARESIA.be.devgen.com>
Message-ID: <34198fe40603070522r5d0920f9q@mail.gmail.com>

2006/3/7, Marc Logghe <Marc.Logghe at devgen.com>:
> Can msbar do something for you ? Msbar = "Mutate sequence beyond all
> recognition"

Thank you. I have checked msbar, it cannot do what I need.


--
Zhiqiang Ye


From pmr at ebi.ac.uk  Tue Mar  7 13:55:39 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 07 Mar 2006 13:55:39 +0000
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <34198fe40603061329u85c9f95p@mail.gmail.com>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
Message-ID: <440D90DB.6040605@ebi.ac.uk>

Zhiqiang Ye wrote:

> Dear all,
> 
>       Does emboss have a handy way for mutate a protein sequence by
> the specified way?
> For example, I have a sequence foo.fasta
> 
> 
>>foo
> 
> MATSCGLLKIIQRE
> 
> It has a mutant called  'A2L'. Is there any way to do this operation
> to output(with an option to check the foo.fasta has 'A' at position
> 2):
> 
>>foo A2L
> 
> MLTSCGLLKIIQRE

EMBOSS has several programs to change sequences. None does exactly what you ask.

You could look at:

biosed (does what you ask for longer replacements, but will change all 'A's to 
'L's.)

We could extend biosed to specify the position of the pattern ... is that what 
you need?

regards,

Peter


From gbottu at ben.vub.ac.be  Tue Mar  7 15:38:08 2006
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Tue, 7 Mar 2006 16:38:08 +0100
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence
In-Reply-To: <34198fe40603061329u85c9f95p@mail.gmail.com>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
Message-ID: <20060307153808.GA15947@bigben.ulb.ac.be>

Your solution works, as does the one proposed by David Bauer. Both are 
however rather tedious. It goes much easier with an interactive sequence 
editor. There is MSE, not a standard EMBOSS program but distributed as 
Embassadir. It runs in a VT100 terminal ; it is a little bit 
intimidating for the novice user, but you can with some practice learn to 
use it. At the BEN site we have besides MSE also installed SeaView, a 
graphical mode editor (has versions for Windows, Macintosh and X-Window). 
These editors are of course only usable if you work locally in your own 
computer or in a terminal session in a remote computer. It will not work 
if you are using a Web interface for EMBOSS ... although some Web 
interfaces might have an applet mode editor that allows to save the 
modified sequence back on the server (is there one in Jemboss ?).

	Hope this helps,
	Guy Bottu,
	Belgian EMBnet Node


From yezhiqiang at gmail.com  Tue Mar  7 13:25:02 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Tue, 7 Mar 2006 21:25:02 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <OFDFF9CC1E.C290B537-ONC125712A.00246F55-C125712A.0024A951@schering.de>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
	<OFDFF9CC1E.C290B537-ONC125712A.00246F55-C125712A.0024A951@schering.de>
Message-ID: <34198fe40603070525w1ffa7155n@mail.gmail.com>

2006/3/7, David.Bauer at schering.de <David.Bauer at schering.de>:
> What about this solution:
>
> cutseq foo.fasta -from 2 -to 2 | pasteseq -filter -pos 1 -bs asis:'L' |
> descseq -filter -append -desc "A2L"

Thanks a lot.  It works very well!

Best

--
Zhiqiang Ye


From yezhiqiang at gmail.com  Tue Mar  7 16:33:40 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Wed, 8 Mar 2006 00:33:40 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence?
In-Reply-To: <440D90DB.6040605@ebi.ac.uk>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
	<440D90DB.6040605@ebi.ac.uk>
Message-ID: <34198fe40603070833p627159b0i@mail.gmail.com>

2006/3/7, Peter Rice <pmr at ebi.ac.uk>:
>
> EMBOSS has several programs to change sequences. None does exactly what you ask.
>
> You could look at:
>
> biosed (does what you ask for longer replacements, but will change all 'A's to
> 'L's.)

 Yeah, it will change all 'A's to 'L's...

> We could extend biosed to specify the position of the pattern ... is that what
> you need?
>
 Yes! If biosed can be extended to do this, it will be better :)

Best Regards!

--
Zhiqiang Ye


From yezhiqiang at gmail.com  Tue Mar  7 16:42:00 2006
From: yezhiqiang at gmail.com (Zhiqiang Ye)
Date: Wed, 8 Mar 2006 00:42:00 +0800
Subject: [EMBOSS] Does emboss have a handy way for mutate a protein
	sequence
In-Reply-To: <20060307153808.GA15947@bigben.ulb.ac.be>
References: <34198fe40603061329u85c9f95p@mail.gmail.com>
	<20060307153808.GA15947@bigben.ulb.ac.be>
Message-ID: <34198fe40603070842i15a34763s@mail.gmail.com>

hi, Guy Bottu

    Thank you. But I have to do a batch of these subsitituion, so a
command line solution will be better.  I write an ugly shell script to
do this according to David Bauer.

#!/bin/sh

mutation=$2;
WT=${mutation:0:1};
POS=${mutation:1:${#mutation}-2};
MT=${mutation: -1}
POS2=`expr $POS - 1`

cat $1 | cutseq -filter -from $POS -to $POS | pasteseq -filter -pos
$POS2 -bs asis:$MT | descseq -filter -append -desc " (mutant:
$mutation )"


With this script mutate.sh in my ~/bin, I can type this:
mutate.sh foo.fasta A2L

Best

--
Zhiqiang Ye


From Marc.Logghe at DEVGEN.com  Wed Mar  8 09:00:14 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Wed, 8 Mar 2006 10:00:14 +0100
Subject: [EMBOSS] Oddcomp behaves oddly ...
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746BA7@ANTARESIA.be.devgen.com>

... Or rather, how should I use it properly ?

OK, suppose your run compseq to obtain the frequency for individual
residues:
compseq tsw:Q62671 -word 1
Apparently this example protein sequence is rather rich in leucine (106
L out of 889).

In order to detect this leucine bias, a little file was created
(leu.comp) that had the following content:
<file leu.comp>
Word size       1
Total count     0

# bias should be detected as 106 > 100
L       100
</file leu.comp>

Oddcomp was run like this:
oddcomp tsw:Q62671 -infile leu.comp -window 889

But the sequece is not reported.
When I change the L count to 10 in leu.comp it does not work neither.
Strangely enough, when the default window is taken (30) the sequence is
reported.
What is happening here ?

Regards,
Marc


From d.gatherer at vir.gla.ac.uk  Wed Mar  8 09:30:13 2006
From: d.gatherer at vir.gla.ac.uk (Derek Gatherer)
Date: Wed, 08 Mar 2006 09:30:13 +0000
Subject: [EMBOSS] clustalw vs. emma
Message-ID: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>

Morning all

Is there some unusual default being passed to emma?  For instance, 
here's emma with a vanilla set of parameters on a fairly well 
conserved set of proteins (bdlf4.fa):

yoda:cluscheck 157 > emma bdlf4.fa -osformat2 msf -out2 bdlf4.emma -auto

  CLUSTAL W (1.83) Multiple Sequence Alignments

Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: AG876-BDLF4      225 aa
Sequence 2: B95-BDLF4        225 aa
Sequence 3: GD1-BDLF4        225 aa
Sequence 4: RLV-BDLF4        238 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:  100
Sequences (1:3) Aligned. Score:  98
Sequences (1:4) Aligned. Score:  85
Sequences (2:3) Aligned. Score:  98
Sequences (2:4) Aligned. Score:  85
Sequences (3:4) Aligned. Score:  86
Guide tree        file created:   [00029986C]
Start of Multiple Alignment
There are 3 groups
Aligning...
Group 1: Sequences:   2      Score:3770
Group 2: Sequences:   3      Score:3741
Group 3: Sequences:   4      Score:3462
Alignment Score 8058
GCG-Alignment file created      [00029986B]

and now clustalw, unwrapped in emma, with the same input file

yoda:cluscheck 158 > clustalw bdlf4.fa

  CLUSTAL W (1.83) Multiple Sequence Alignments

Sequence format is Pearson
Sequence 1: AG876-BDLF4      225 aa
Sequence 2: B95-BDLF4        225 aa
Sequence 3: GD1-BDLF4        225 aa
Sequence 4: RLV-BDLF4        238 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:  100
Sequences (1:3) Aligned. Score:  98
Sequences (1:4) Aligned. Score:  88
Sequences (2:3) Aligned. Score:  98
Sequences (2:4) Aligned. Score:  88
Sequences (3:4) Aligned. Score:  88
Guide tree        file created:   [bdlf4.dnd]
Start of Multiple Alignment
There are 3 groups
Aligning...
Group 1: Sequences:   2      Score:4959
Group 2: Sequences:   3      Score:4928
Group 3: Sequences:   4      Score:4677
Alignment Score 8187
CLUSTAL-Alignment file created  [bdlf4.aln]

Why is the scoring subtly different?  and see what it does to the 
N-terminal of the alignment....

First with emma:

            1                                               50
AG876-BDLF4 ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
B95-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
GD1-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
RLV-BDLF4   MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLP

now with clustalw:

AG876-BDLF4      MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
B95-BDLF4        MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
GD1-BDLF4        MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
RLV-BDLF4        MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLPESMASVFACW
                  ***:**:*              * ***..  **.********** *:*************

Clustalw alone clearly gives the correct alignment whereas emma is 
wrong.  I thought that emma simply wrapped clustalw for automation, 
but it appears it is doing something else.  Out of a set of 80 
proteins I am trying to pipeline through alignment, emma gives a 
variant result for 7 of them.....

Any thoughts, as always, much appreciated

cheers
Derek


From Marc.Logghe at DEVGEN.com  Wed Mar  8 10:36:56 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Wed, 8 Mar 2006 11:36:56 +0100
Subject: [EMBOSS] Oddcomp behaves oddly ...
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746BAA@ANTARESIA.be.devgen.com>

> Basically what is happening is that there is a check for the 
> length of the sequence being shorter than the window.  It may 
> well be this that is giving the problem. 

This was a perfect diagnosis. It works fine when I make the window size
off one.
But I guess it should not be a problem for oddcomp being the window size
equal (or even larger) to the length of the sequence ? It is a way of
saying: don't bother with window sizes, just take the complete thing.
Could be a nice to have feature.
Thanks David,
Marc


From david at compbio.dundee.ac.uk  Wed Mar  8 10:26:23 2006
From: david at compbio.dundee.ac.uk (David Martin)
Date: Wed, 08 Mar 2006 10:26:23 +0000
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746BA7@ANTARESIA.be.devgen.com>
Message-ID: <C03461CF.1A4BD%david@compbio.dundee.ac.uk>

On 8/3/06 9:00 am, "Marc Logghe" <Marc.Logghe at devgen.com> wrote:

> ... Or rather, how should I use it properly ?
> 
> OK, suppose your run compseq to obtain the frequency for individual
> residues:
> compseq tsw:Q62671 -word 1
> Apparently this example protein sequence is rather rich in leucine (106
> L out of 889).
> 
> In order to detect this leucine bias, a little file was created
> (leu.comp) that had the following content:
> <file leu.comp>
> Word size       1
> Total count     0
> 
> # bias should be detected as 106 > 100
> L       100
> </file leu.comp>
> 
> Oddcomp was run like this:
> oddcomp tsw:Q62671 -infile leu.comp -window 889

Try window 888 (ie shorter than the length of the sequence). There are a
couple of minor bugs in the oddcomp code that I will forward to the team.

Basically what is happening is that there is a check for the length of the
sequence being shorter than the window.  It may well be this that is giving
the problem. 

It is a long time since I wrote this and C is not my usual language so
apologies if this is not a comprehensive answer.

..d

> 
> But the sequece is not reported.
> When I change the L count to 10 in leu.comp it does not work neither.
> Strangely enough, when the default window is taken (30) the sequence is
> reported.
> What is happening here ?
> 
> Regards,
> Marc
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss


From jison at ebi.ac.uk  Wed Mar  8 11:05:33 2006
From: jison at ebi.ac.uk (Jon Ison)
Date: Wed, 8 Mar 2006 11:05:33 -0000 (GMT)
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746BAA@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA6746BAA@ANTARESIA.be.devgen.com>
Message-ID: <56257.84.92.187.247.1141815933.squirrel@webmail.ebi.ac.uk>

Hi Marc

What might be cleaner is if we modify the ACD file so that any window
size bigger than the sequence length is reprompted for.

Also, to add a qualifier to set the window to the sequence length, if
that'd help.

Cheers

Jon


>> Basically what is happening is that there is a check for the
>> length of the sequence being shorter than the window.  It may
>> well be this that is giving the problem.
>
> This was a perfect diagnosis. It works fine when I make the window size
> off one.
> But I guess it should not be a problem for oddcomp being the window size
> equal (or even larger) to the length of the sequence ? It is a way of
> saying: don't bother with window sizes, just take the complete thing.
> Could be a nice to have feature.
> Thanks David,
> Marc
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From Marc.Logghe at DEVGEN.com  Wed Mar  8 11:36:06 2006
From: Marc.Logghe at DEVGEN.com (Marc Logghe)
Date: Wed, 8 Mar 2006 12:36:06 +0100
Subject: [EMBOSS] Oddcomp behaves oddly ...
Message-ID: <0C528E3670D8CE4B8E013F6749231AA6746BAB@ANTARESIA.be.devgen.com>

Hi David,
I am afraid there are some remaining oddities with oddcomp.
Tried another protein, other residue.
<file compseq.data>
Word size       1
Total count     0

S       4
<file compseq.data>

First a set of sequences is generated (kind of mimicking sliding window)
of length 20:
splitter wormpep:ZK822.4 -size 20 -overlap 19 > split.fa

Second, oddseq is run (with window option off by one):
oddcomp split.fa -window 19 -infile compseq.data 
#
# Output from 'oddcomp'
#
# The Expected frequencies are taken from the file: compseq.data
#
#       Word size: 1
        ZK822.4_36-55
        ZK822.4_37-56
        ZK822.4_38-57
        ZK822.4_39-58
        ZK822.4_40-59
        ZK822.4_41-60

#       END     #

The first 20mer:
>ZK822.4_36-55
SAGSSGSNFLSGLQNSSFGQ

It is clear that there are 7 S residues in this stretch and we were
looking for 4 or more, so that makes sense.
However, when you run oddseq again with S count of 5 instead of 4, no
sequence is reported !
Cheers,
Marc


> -----Original Message-----
> From: David Martin [mailto:david at compbio.dundee.ac.uk] 
> Sent: Wednesday, March 08, 2006 11:26 AM
> To: Marc Logghe; emboss at emboss.open-bio.org
> Subject: Re: [EMBOSS] Oddcomp behaves oddly ...
> 
> On 8/3/06 9:00 am, "Marc Logghe" <Marc.Logghe at devgen.com> wrote:
> 
> > ... Or rather, how should I use it properly ?
> > 
> > OK, suppose your run compseq to obtain the frequency for individual
> > residues:
> > compseq tsw:Q62671 -word 1
> > Apparently this example protein sequence is rather rich in leucine 
> > (106 L out of 889).
> > 
> > In order to detect this leucine bias, a little file was created
> > (leu.comp) that had the following content:
> > <file leu.comp>
> > Word size       1
> > Total count     0
> > 
> > # bias should be detected as 106 > 100
> > L       100
> > </file leu.comp>
> > 
> > Oddcomp was run like this:
> > oddcomp tsw:Q62671 -infile leu.comp -window 889
> 
> Try window 888 (ie shorter than the length of the sequence). 
> There are a couple of minor bugs in the oddcomp code that I 
> will forward to the team.
> 
> Basically what is happening is that there is a check for the 
> length of the sequence being shorter than the window.  It may 
> well be this that is giving the problem. 
> 
> It is a long time since I wrote this and C is not my usual 
> language so apologies if this is not a comprehensive answer.
> 
> ..d
> 
> > 
> > But the sequece is not reported.
> > When I change the L count to 10 in leu.comp it does not 
> work neither.
> > Strangely enough, when the default window is taken (30) the 
> sequence 
> > is reported.
> > What is happening here ?
> > 
> > Regards,
> > Marc
> > 
> > _______________________________________________
> > EMBOSS mailing list
> > EMBOSS at emboss.open-bio.org
> > http://newportal.open-bio.org/mailman/listinfo/emboss
> 
> 
> 


From pmr at ebi.ac.uk  Wed Mar  8 12:09:25 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 08 Mar 2006 12:09:25 +0000
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <C03461CF.1A4BD%david@compbio.dundee.ac.uk>
References: <C03461CF.1A4BD%david@compbio.dundee.ac.uk>
Message-ID: <440EC975.6090907@ebi.ac.uk>

David Martin wrote:

> Basically what is happening is that there is a check for the length of the
> sequence being shorter than the window.  It may well be this that is giving
> the problem. 

Not that part - it accepts a window the same length as the sequence (oddcomp 
can read more than one sequence, and does have to skip those too short to fit 
a window).

A later loop does fail if the window size matches the sequence - I am testing 
allowing it to run just one more time :-)

> It is a long time since I wrote this and C is not my usual language so
> apologies if this is not a comprehensive answer.

Snakke de fortran?

>>But the sequece is not reported.
>>When I change the L count to 10 in leu.comp it does not work neither.
>>Strangely enough, when the default window is taken (30) the sequence is
>>reported.

Same problem I believe - it is the window size matching sequence length that 
stops the last for loop from checking anything.

regadrs,

Peter


From pmr at ebi.ac.uk  Wed Mar  8 13:13:24 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 08 Mar 2006 13:13:24 +0000
Subject: [EMBOSS] Oddcomp behaves oddly ...
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA6746BAB@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA6746BAB@ANTARESIA.be.devgen.com>
Message-ID: <440ED874.7070100@ebi.ac.uk>

Marc Logghe wrote:

> Hi David,
> I am afraid there are some remaining oddities with oddcomp.
> The first 20mer:
> 
>>ZK822.4_36-55
> 
> SAGSSGSNFLSGLQNSSFGQ
> 
> It is clear that there are 7 S residues in this stretch and we were
> looking for 4 or more, so that makes sense.
> However, when you run oddseq again with S count of 5 instead of 4, no
> sequence is reported !

At least 2 bugs here. Firstly, with more than one sequence as input, some 
internal values were not fully reset. Also the word size is used (as 2) before 
it is set to 1.

For 8 Serines in this set I am still only getting one hit out of two. A little 
more investigation needed ... I am getting closer :-)

regards,

Peter


From ajb at ebi.ac.uk  Thu Mar  9 15:58:33 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 9 Mar 2006 15:58:33 -0000 (GMT)
Subject: [EMBOSS] clustalw vs. emma
In-Reply-To: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>
References: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>
Message-ID: <45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>

Hi Derek,

emma is indeed just a wrapper for clustalw. You can see what default
parameters it is using by specifying -debug on the command line
and then looking at the emma.dbg file. Search for a line
saying "Executing 'clustalw"

I suspect that the default gap extension penalty is rather high
in your case. If you use (e.g.) -gapext 0.2   then you'll get
something approaching the default clustalw behaviour. The defaults
for your sequences seem to be:

  -gapopen=10.000 -gapext=5.000 -gapdist=8


HTH

Alan

> Morning all
>
> Is there some unusual default being passed to emma?  For instance,
> here's emma with a vanilla set of parameters on a fairly well
> conserved set of proteins (bdlf4.fa):
>
> yoda:cluscheck 157 > emma bdlf4.fa -osformat2 msf -out2 bdlf4.emma -auto
>
>   CLUSTAL W (1.83) Multiple Sequence Alignments
>
> Sequence type explicitly set to Protein
> Sequence format is Pearson
> Sequence 1: AG876-BDLF4      225 aa
> Sequence 2: B95-BDLF4        225 aa
> Sequence 3: GD1-BDLF4        225 aa
> Sequence 4: RLV-BDLF4        238 aa
> Start of Pairwise alignments
> Aligning...
> Sequences (1:2) Aligned. Score:  100
> Sequences (1:3) Aligned. Score:  98
> Sequences (1:4) Aligned. Score:  85
> Sequences (2:3) Aligned. Score:  98
> Sequences (2:4) Aligned. Score:  85
> Sequences (3:4) Aligned. Score:  86
> Guide tree        file created:   [00029986C]
> Start of Multiple Alignment
> There are 3 groups
> Aligning...
> Group 1: Sequences:   2      Score:3770
> Group 2: Sequences:   3      Score:3741
> Group 3: Sequences:   4      Score:3462
> Alignment Score 8058
> GCG-Alignment file created      [00029986B]
>
> and now clustalw, unwrapped in emma, with the same input file
>
> yoda:cluscheck 158 > clustalw bdlf4.fa
>
>   CLUSTAL W (1.83) Multiple Sequence Alignments
>
> Sequence format is Pearson
> Sequence 1: AG876-BDLF4      225 aa
> Sequence 2: B95-BDLF4        225 aa
> Sequence 3: GD1-BDLF4        225 aa
> Sequence 4: RLV-BDLF4        238 aa
> Start of Pairwise alignments
> Aligning...
> Sequences (1:2) Aligned. Score:  100
> Sequences (1:3) Aligned. Score:  98
> Sequences (1:4) Aligned. Score:  88
> Sequences (2:3) Aligned. Score:  98
> Sequences (2:4) Aligned. Score:  88
> Sequences (3:4) Aligned. Score:  88
> Guide tree        file created:   [bdlf4.dnd]
> Start of Multiple Alignment
> There are 3 groups
> Aligning...
> Group 1: Sequences:   2      Score:4959
> Group 2: Sequences:   3      Score:4928
> Group 3: Sequences:   4      Score:4677
> Alignment Score 8187
> CLUSTAL-Alignment file created  [bdlf4.aln]
>
> Why is the scoring subtly different?  and see what it does to the
> N-terminal of the alignment....
>
> First with emma:
>
>             1                                               50
> AG876-BDLF4 ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> B95-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> GD1-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> RLV-BDLF4   MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLP
>
> now with clustalw:
>
> AG876-BDLF4
> MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> B95-BDLF4
> MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> GD1-BDLF4
> MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> RLV-BDLF4
> MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLPESMASVFACW
>                   ***:**:*              * ***..  **.**********
> *:*************
>
> Clustalw alone clearly gives the correct alignment whereas emma is
> wrong.  I thought that emma simply wrapped clustalw for automation,
> but it appears it is doing something else.  Out of a set of 80
> proteins I am trying to pipeline through alignment, emma gives a
> variant result for 7 of them.....
>
> Any thoughts, as always, much appreciated
>
> cheers
> Derek
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From d.gatherer at vir.gla.ac.uk  Thu Mar  9 16:18:55 2006
From: d.gatherer at vir.gla.ac.uk (Derek Gatherer)
Date: Thu, 09 Mar 2006 16:18:55 +0000
Subject: [EMBOSS] clustalw vs. emma
In-Reply-To: <45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>
References: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>
	<45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>
Message-ID: <6.2.3.4.1.20060309160317.02abb870@lenzie.gla.ac.uk>

Thanks Alan

That indeed is the cause of the problem:

Executing 'clustalw -infile=00052348A -outfile=00052348B -align 
-type=protein -o
utput=gcg -pwmatrix=blosum -pwgapopen=10.000 -pwgapext=0.100 
-newtree=00052348C
-matrix=blosum -gapopen=10.000 -gapext=5.000 -gapdist=8 
-hgapresidues=GPSNDQEKR
-maxdiv=30'

However, on attempting to manually specify it, I run into another one:

[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -pwgapextend 5
Died: Unknown qualifier -pwgapextend

In the docs http://emboss.sourceforge.net/apps/cvs/emma.html, there 
are quite a few optional parameters of this sort, some of which work 
and others don't, eg:

[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -gapextend 5
Died: Unknown qualifier -gapextend
[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -pwgapextend 5
Died: Unknown qualifier -pwgapextend
[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -gapopen 5
Died: Unknown qualifier -gapopen
[gath01d at gamma cluscheck]$ emma bdlf4.fa -osformat2 msf -out2 
bdlf4.emma -auto -debug -gapdist 5

CLUSTAL W (1.83) Multiple Sequence Alignments

so -gapdist works at least.

Cheers
Derek


At 15:58 09/03/2006, ajb at ebi.ac.uk wrote:
>Hi Derek,
>
>emma is indeed just a wrapper for clustalw. You can see what default
>parameters it is using by specifying -debug on the command line
>and then looking at the emma.dbg file. Search for a line
>saying "Executing 'clustalw"
>
>I suspect that the default gap extension penalty is rather high
>in your case. If you use (e.g.) -gapext 0.2   then you'll get
>something approaching the default clustalw behaviour. The defaults
>for your sequences seem to be:
>
>   -gapopen=10.000 -gapext=5.000 -gapdist=8
>
>
>HTH
>
>Alan
>
> > Morning all
> >
> > Is there some unusual default being passed to emma?  For instance,
> > here's emma with a vanilla set of parameters on a fairly well
> > conserved set of proteins (bdlf4.fa):
> >
> > yoda:cluscheck 157 > emma bdlf4.fa -osformat2 msf -out2 bdlf4.emma -auto
> >
> >   CLUSTAL W (1.83) Multiple Sequence Alignments
> >
> > Sequence type explicitly set to Protein
> > Sequence format is Pearson
> > Sequence 1: AG876-BDLF4      225 aa
> > Sequence 2: B95-BDLF4        225 aa
> > Sequence 3: GD1-BDLF4        225 aa
> > Sequence 4: RLV-BDLF4        238 aa
> > Start of Pairwise alignments
> > Aligning...
> > Sequences (1:2) Aligned. Score:  100
> > Sequences (1:3) Aligned. Score:  98
> > Sequences (1:4) Aligned. Score:  85
> > Sequences (2:3) Aligned. Score:  98
> > Sequences (2:4) Aligned. Score:  85
> > Sequences (3:4) Aligned. Score:  86
> > Guide tree        file created:   [00029986C]
> > Start of Multiple Alignment
> > There are 3 groups
> > Aligning...
> > Group 1: Sequences:   2      Score:3770
> > Group 2: Sequences:   3      Score:3741
> > Group 3: Sequences:   4      Score:3462
> > Alignment Score 8058
> > GCG-Alignment file created      [00029986B]
> >
> > and now clustalw, unwrapped in emma, with the same input file
> >
> > yoda:cluscheck 158 > clustalw bdlf4.fa
> >
> >   CLUSTAL W (1.83) Multiple Sequence Alignments
> >
> > Sequence format is Pearson
> > Sequence 1: AG876-BDLF4      225 aa
> > Sequence 2: B95-BDLF4        225 aa
> > Sequence 3: GD1-BDLF4        225 aa
> > Sequence 4: RLV-BDLF4        238 aa
> > Start of Pairwise alignments
> > Aligning...
> > Sequences (1:2) Aligned. Score:  100
> > Sequences (1:3) Aligned. Score:  98
> > Sequences (1:4) Aligned. Score:  88
> > Sequences (2:3) Aligned. Score:  98
> > Sequences (2:4) Aligned. Score:  88
> > Sequences (3:4) Aligned. Score:  88
> > Guide tree        file created:   [bdlf4.dnd]
> > Start of Multiple Alignment
> > There are 3 groups
> > Aligning...
> > Group 1: Sequences:   2      Score:4959
> > Group 2: Sequences:   3      Score:4928
> > Group 3: Sequences:   4      Score:4677
> > Alignment Score 8187
> > CLUSTAL-Alignment file created  [bdlf4.aln]
> >
> > Why is the scoring subtly different?  and see what it does to the
> > N-terminal of the alignment....
> >
> > First with emma:
> >
> >             1                                               50
> > AG876-BDLF4 ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> > B95-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> > GD1-BDLF4   ~~~~~~~~~~~~~MSDQGRLSLPRGEGGTDEPNPRHLCSYSKLEFHLPLP
> > RLV-BDLF4   MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLP
> >
> > now with clustalw:
> >
> > AG876-BDLF4
> > MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> > B95-BDLF4
> > MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> > GD1-BDLF4
> > MSDQGRLS-------------LPRGEGGTDEPNPRHLCSYSKLEFHLPLPESMASVFACW
> > RLV-BDLF4
> > MSDHGRVSGRPRGAVRGRGASSPDGEGAPTGPNSRHLCSYSKLESHFPLPESMASVFACW
> >                   ***:**:*              * ***..  **.**********
> > *:*************
> >
> > Clustalw alone clearly gives the correct alignment whereas emma is
> > wrong.  I thought that emma simply wrapped clustalw for automation,
> > but it appears it is doing something else.  Out of a set of 80
> > proteins I am trying to pipeline through alignment, emma gives a
> > variant result for 7 of them.....
> >
> > Any thoughts, as always, much appreciated
> >
> > cheers
> > Derek
> > _______________________________________________
> > EMBOSS mailing list
> > EMBOSS at emboss.open-bio.org
> > http://newportal.open-bio.org/mailman/listinfo/emboss
> >


From pmr at ebi.ac.uk  Thu Mar  9 17:01:15 2006
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 09 Mar 2006 17:01:15 +0000
Subject: [EMBOSS] clustalw vs. emma
In-Reply-To: <6.2.3.4.1.20060309160317.02abb870@lenzie.gla.ac.uk>
References: <6.2.3.4.1.20060308092040.02aabc78@lenzie.gla.ac.uk>	<45378.81.96.70.96.1141919913.squirrel@webmail.ebi.ac.uk>
	<6.2.3.4.1.20060309160317.02abb870@lenzie.gla.ac.uk>
Message-ID: <44105F5B.3050200@ebi.ac.uk>

Derek Gatherer wrote:

> In the docs http://emboss.sourceforge.net/apps/cvs/emma.html, there 
> are quite a few optional parameters of this sort, some of which work 
> and others don't, eg:

Yup - we're putting that right (some people have noticed the application docs 
are moving around).

The emboss.sf.net website only documents things for the latest code in CVS. We 
are adding documentation for release 3.0.0 (that is why the new directories 
are appearing).

The release 3.0.0 documentation is installed on your system when you install 
3.0.0 - if you install to /usr/local/bin it will be in:

/usr/local/share/EMBOSS/doc/programs/html (this will change in release 4.0.0).

You are seeing some of the changes made to make standard names for command 
line qualifiers since 3.0.0

Hope that helps,

Peter


From blanchard at microbio.umass.edu  Thu Mar  9 21:18:55 2006
From: blanchard at microbio.umass.edu (Jeffrey Blanchard)
Date: Thu, 9 Mar 2006 16:18:55 -0500
Subject: [EMBOSS] d_ino
Message-ID: <C2354214-A937-4FFE-BCD4-0562F896482D@microbio.umass.edu>

Hello,

I am trying to install EMBOSS under cygwin for teaching purposes.

make crashes on ajfile because d_ino appears to be missing in current  
version of cygwin.

Is there a work around for this?

Thanks, Jeff

-------------------------------
Jeffrey L. Blanchard
Assistant Professor
Department of Microbiology
University of Massachusetts
Amherst, MA 01003
Office and Lab: Morrill I N330
Tel: 413-577-2130
Fax: 413-545-1578
http://www.bio.umass.edu/micro/blanchard/Lab_About.html


From ajb at ebi.ac.uk  Fri Mar 10 00:22:45 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 10 Mar 2006 00:22:45 -0000 (GMT)
Subject: [EMBOSS] d_ino
In-Reply-To: <C2354214-A937-4FFE-BCD4-0562F896482D@microbio.umass.edu>
References: <C2354214-A937-4FFE-BCD4-0562F896482D@microbio.umass.edu>
Message-ID: <41243.81.96.70.96.1141950165.squirrel@webmail.ebi.ac.uk>

Hi,

Yes indeed there is a fix. Look in the directory.

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/

The README file there will usually tell you what each of the files
fixes.

HTH

Alan Bleasby
EBI


> Hello,
>
> I am trying to install EMBOSS under cygwin for teaching purposes.
>
> make crashes on ajfile because d_ino appears to be missing in current
> version of cygwin.
>
> Is there a work around for this?
>
> Thanks, Jeff
>
> -------------------------------
> Jeffrey L. Blanchard
> Assistant Professor
> Department of Microbiology
> University of Massachusetts
> Amherst, MA 01003
> Office and Lab: Morrill I N330
> Tel: 413-577-2130
> Fax: 413-545-1578
> http://www.bio.umass.edu/micro/blanchard/Lab_About.html
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at emboss.open-bio.org
> http://newportal.open-bio.org/mailman/listinfo/emboss
>


From jison at ebi.ac.uk  Wed Mar 15 17:09:59 2006
From: jison at ebi.ac.uk (Jon Ison)
Date: Wed, 15 Mar 2006 17:09:59 -0000 (GMT)
Subject: [EMBOSS] EMBOSS Developers Course - reminder
Message-ID: <39760.172.31.70.94.1142442599.squirrel@webmail.ebi.ac.uk>

Hi

There's still some places left on this course.
Get in touch if you'd like to attend.

Cheers

Jon


BSDC 2006
Bioinformatics Software Development Course
April 18-20 2006

Following from the highly successful BSDC 2003/2004 courses, a new
series of courses on 'Bioinformatics Software Development' using
EMBOSS will be held in the training room at The Wellcome Trust Conference
Centre on April 18-20, 2006.

The course will give a good introduction to programming in EMBOSS.
By the end of the course you will be experienced in all the steps in
writing a basic bioinformatics application using the EMBOSS
programming libraries.

The course would suit competent programmers, probably with at least a
couple of years of experience. A reasonable working knowledge of C is
required to get the most out of the course, familiarity with pointers
is helpful but not essential. That said, all are welcome regardless
of background or experience.

Places are limited so please email Liz Ford (ford at ebi.ac.uk) to register
as soon as possible.

We do not make a profit on the course but must charge #125 / person
(for the 3-days) to recover some of our costs.
We are unable to take credit card payments. The preferred method
of payment is by cheque made payable to 'Industry Workshops'.
If you wish to pay in cash or by bank transfer please contact Liz Ford
(ford at ebi.ac.uk)

To read more about the course see
http://emboss.sourceforge.net/developers/developers_course/
To read more about EMBOSS see
http://emboss.sourceforge.net/

To register:
email Liz Ford (ford at ebi.ac.uk)
with your full name, address, phone number
You will then receive an email back confirming your registration or not.
Please note, as mentioned before, places are limited so not all registrations
will be successful.

For further information
email Jon Ison (jison at ebi.ac.uk)


From pmr at ebi.ac.uk  Mon Mar 27 17:50:09 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Mon, 27 Mar 2006 18:50:09 +0100 (BST)
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <00b501c651be$95b37500$e6028a0a@GOLHARMOBILE1>
References: <00b501c651be$95b37500$e6028a0a@GOLHARMOBILE1>
Message-ID: <2253.86.132.217.176.1143481809.squirrel@webmail.ebi.ac.uk>

Ryan Golhar wrote:

> I have a BLAST alignment: query sequence and database sequence.
>
> The alignment is only showing the HSP from the blast output as expected,
> however I want to build an alignment of the entire database sequence
> against my query sequence.
>
> I tried using needle from EMBOSS, however its aligning the sequences
> completely different than BLAST does.  What I'd really like is a way to
> anchor the alignment based on the BLAST HSP.  Does anyone know how to do
> this, or what tool(s) will allow me to do this?

You are quite right that EMBOSS may align the sequences completely
differently - unless the HSPs are very significant and cover most of the
sequence this will be true of any attempt to simply realign. There has to
be some way to pass on the HSPs as fixed positions, as in the BioPerl
solution.

However, it could make a nice EMBOSS application - the only question would
be how you would like to specify the HSPs. Perhaps we could read BLAST
output (in some specified format), or perhaps some other way to give the
input alignments.

We do have at least one EMBOSS application that does something similar
(finds all long perfect matches and interpolates) - we just need to reuse
the interpolation code which is basically doing a global alignment of the
bits in between. That also tackles the problem of choosing which
non-compatible initial matches to use.

Hope that helps,

Peter


From golharam at umdnj.edu  Mon Mar 27 16:50:42 2006
From: golharam at umdnj.edu (Ryan Golhar)
Date: Mon, 27 Mar 2006 11:50:42 -0500
Subject: [EMBOSS] Building an alignment from BLAST hsp
Message-ID: <00b501c651be$95b37500$e6028a0a@GOLHARMOBILE1>

I have a BLAST alignment: query sequence and database sequence.  

The alignment is only showing the HSP from the blast output as expected,
however I want to build an alignment of the entire database sequence
against my query sequence.  

I tried using needle from EMBOSS, however its aligning the sequences
completely different than BLAST does.  What I'd really like is a way to
anchor the alignment based on the BLAST HSP.  Does anyone know how to do
this, or what tool(s) will allow me to do this?

Ryan


From golharam at umdnj.edu  Mon Mar 27 18:03:39 2006
From: golharam at umdnj.edu (Ryan Golhar)
Date: Mon, 27 Mar 2006 13:03:39 -0500
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <2253.86.132.217.176.1143481809.squirrel@webmail.ebi.ac.uk>
Message-ID: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1>

Hi Peter,

> You are quite right that EMBOSS may align the sequences completely 
> differently - unless the HSPs are very significant and cover most 
> of the sequence this will be true of any attempt to simply realign. 
> There has to be some way to pass on the HSPs as fixed positions, 
> as in the BioPerl solution.

I looked at a bioperl method, but can't seem to find something that will
accomplish this.  

> However, it could make a nice EMBOSS application - the only question 
> would be how you would like to specify the HSPs. Perhaps we could read

> BLAST output (in some specified format), or perhaps some other way to 
> give the input alignments.

Yes, I agree.  I suppose the best way would be to specify the two
sequences and the blast output.  The application could then construct an
alignment based on a particular HSP (probably the first one, or whatever
the user specifies).

Ryan


From letondal at pasteur.fr  Tue Mar 28 07:25:07 2006
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue, 28 Mar 2006 09:25:07 +0200
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1>
References: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1>
Message-ID: <4b91818a096ba42d8d53279a7f63e6ea@pasteur.fr>


On Mar 27, 2006, at 8:03 PM, Ryan Golhar wrote:

> Hi Peter,
>
>> You are quite right that EMBOSS may align the sequences completely
>> differently - unless the HSPs are very significant and cover most
>> of the sequence this will be true of any attempt to simply realign.
>> There has to be some way to pass on the HSPs as fixed positions,
>> as in the BioPerl solution.
>
> I looked at a bioperl method, but can't seem to find something that 
> will
> accomplish this.
>
>> However, it could make a nice EMBOSS application - the only question
>> would be how you would like to specify the HSPs. Perhaps we could read
>
>> BLAST output (in some specified format), or perhaps some other way to
>> give the input alignments.
>
> Yes, I agree.  I suppose the best way would be to specify the two
> sequences and the blast output.  The application could then construct 
> an
> alignment based on a particular HSP (probably the first one, or 
> whatever
> the user specifies).
>

Have you tried this:
http://bioweb.pasteur.fr/seqanal/interfaces/seqsblast.html

It is based on bioperl. check "Get HSP" option (you can even extend it).

Best,

--
Catherine Letondal -- Institut Pasteur -- Computing Center


From cquijano at iib.uam.es  Tue Mar 28 09:49:01 2006
From: cquijano at iib.uam.es (Carlos Quijano)
Date: Tue, 28 Mar 2006 11:49:01 +0200
Subject: [EMBOSS] [BiO BB] Building an alignment from BLAST hsp
In-Reply-To: <4b91818a096ba42d8d53279a7f63e6ea@pasteur.fr>
References: <010501c651c8$c6b4bb00$e6028a0a@GOLHARMOBILE1> 
	<4b91818a096ba42d8d53279a7f63e6ea@pasteur.fr>
Message-ID: <1143539342.8611.45.camel@localhost.localdomain>

Hi all,

I didnt read it before, sorry for the "lapsus". And sorry for the
information if what I tell you is not exactly what you needed, Ryan.

What you are looking for is just _MVIEW_, an old but nice application.
Use scholar.google.com / pubmed to find more information about it, I
remember that there are web servers running cgi's somewhere. It is
possible than during this last years, somebody has published a new
better tool or a new mview version.... Look for it.

MVIEW is a parser for your blast output.
MVIEW works for your problem because you wanna align only one sequence
(as a template) to a entire database (I suppose that with any cutoff in
the e-value or p-vale, at least the default, it is, ten) or against a
set of some sequences or only one more sequence (2 sequences alignment).

I continue with some considerations about aligning HSPs from Blast the
way you pretend and mview does... there are important considerations and
it is only a minute to read:
Remember, what you get is what you wanted, but not a real thing (this is
something very typical in bioinformatics - and all science - hahaha).
You dont get a real multiple alignment, you get an artifact that is a
entire database's gene-blast.hsps constructs piled down a template gene
(your sequence). 
All right then. You dont have by any means an alignment, nor even an
alignment of the genes using HSPs, because, there can be some hsps
alignable between sequences in the database that are hidden for the
alignment when sequences are piled down your sequence, because your
sequence lacks this hsps and are _ignored_. 
Why is this so important?
What I actually mean is that if you use this "sequences piled down a
template" as a multiple alignment, you will be lying about the topology
underlying (it is, not lying ;-) in the gene network, that arises from
your database plus your sequence when correctly aligned, it is, all
against all... etc,etc, etc.
Well, it is the mathematical exhaustive-optimal way... normally we use
heuristics again, and again, and again... But "all against all" is the
key concept involved in the multiple alignment problem. It is very
important to be aware of this things.
needle is the optimal way <-> Blast is the heuristic
Clustal is also a very very heuristic solution to the massive problem of
multiple alignment. And personally I prefer to use muscle that uses a
better mathematical model and is (right now) the quickest aligner for
the most of the cases.

I am sure that most of you know it. 
I hope it is usefull for newbies and others, so forgive me for the
boring tedious discourse...


CQ

El mar, 28-03-2006 a las 09:25 +0200, Catherine Letondal escribi?:

> On Mar 27, 2006, at 8:03 PM, Ryan Golhar wrote:
> 
> > Hi Peter,
> >
> >> You are quite right that EMBOSS may align the sequences completely
> >> differently - unless the HSPs are very significant and cover most
> >> of the sequence this will be true of any attempt to simply realign.
> >> There has to be some way to pass on the HSPs as fixed positions,
> >> as in the BioPerl solution.
> >
> > I looked at a bioperl method, but can't seem to find something that 
> > will
> > accomplish this.
> >
> >> However, it could make a nice EMBOSS application - the only question
> >> would be how you would like to specify the HSPs. Perhaps we could read
> >
> >> BLAST output (in some specified format), or perhaps some other way to
> >> give the input alignments.
> >
> > Yes, I agree.  I suppose the best way would be to specify the two
> > sequences and the blast output.  The application could then construct 
> > an
> > alignment based on a particular HSP (probably the first one, or 
> > whatever
> > the user specifies).
> >
> 
> Have you tried this:
> http://bioweb.pasteur.fr/seqanal/interfaces/seqsblast.html
> 
> It is based on bioperl. check "Get HSP" option (you can even extend it).
> 
> Best,
> 
> --
> Catherine Letondal -- Institut Pasteur -- Computing Center
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss

Carlos Quijano
http://www2.iib.uam.es/cquijano
Evolution and Development laboratory
Regulation of Gene Expression Department
Institute for Biomedical Research
http://www.iib.uam.es


From kvddrift at earthlink.net  Thu Mar 30 00:36:23 2006
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Wed, 29 Mar 2006 19:36:23 -0500
Subject: [EMBOSS] crash on intel-Mac
Message-ID: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>

Hi,

I got a report from a user (of the fink package of emboss) that the  
following crashes occur on his Mac with an intel processor:

% wossname
Error: Failed to compile regular expression '^(.*/)[^/]+/?$' at  
position 716: range out of order in character class
Bus error

All other programs just give a bus error.


I don't get these errors on a Mac with a PowerPC processor.

This is emboss 3.0.0.


- Koen.


From areagp61 at yahoo.it  Thu Mar 30 08:31:42 2006
From: areagp61 at yahoo.it (Graziano P.)
Date: Thu, 30 Mar 2006 10:31:42 +0200 (CEST)
Subject: [EMBOSS] dbifasta index file format
Message-ID: <20060330083142.4237.qmail@web26207.mail.ukl.yahoo.com>

hello EMBOSS users,
I have some databases in fasta format (ncbi | format)
and I want to index them using dbifasta, then I want
to access the index files using a program that will be
developed by a computer scientist of my group.
I need to index the databases by accession number,
ginumber and description. I have read in the dbifasta
help info about the structure of the index files when
the databases were indexed by accession number, but I
have not found info about the structure of the index
files when the databases are indexed by description.
Anyone knows where I can find detailed information
about the structure of the index files?

Regards
Graziano


___________________________________ 
Yahoo! Messenger with Voice: chiama da PC a telefono a tariffe esclusive 
http://it.messenger.yahoo.com


From ajb at ebi.ac.uk  Thu Mar 30 08:38:10 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 30 Mar 2006 09:38:10 +0100 (BST)
Subject: [EMBOSS] crash on intel-Mac
In-Reply-To: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
References: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
Message-ID: <37407.81.98.244.247.1143707890.squirrel@webmail.ebi.ac.uk>

Hi,

Thanks. We already have a report of this and are working on a
solution.

Alan


From gbottu at ben.vub.ac.be  Thu Mar 30 09:37:23 2006
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Thu, 30 Mar 2006 11:37:23 +0200
Subject: [EMBOSS] A note about fastA format(s) - Checked by AntiVir DEMO
	version -
Message-ID: <20060330093723.GA18690@bigben.ulb.ac.be>

	Dear friends,

We are using EMBOSS version 3.0. One of my colleagues tried to use a 
multiple sequence file in fastA format, where each comment line starts 
with a string containing multiple pipe signs. An USA of type
fasta::file:xx|yy|zz|uu|ss
did not work. After some trial I found that putting "pearson" instead of 
"fasta" helped. This is strange, since according to the on-line manual at 
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
"fasta" and "pearson" are synonyms. Here it seems that "fasta" is instead 
treated the same as "ncbi". Comments ?

	Guy Bottu,
	BEN
 

From enrique.deandres at pcm.uam.es  Thu Mar 30 15:46:30 2006
From: enrique.deandres at pcm.uam.es (Enrique de Andres Saiz)
Date: Thu, 30 Mar 2006 17:46:30 +0200
Subject: [EMBOSS] Problem indexing PDB fasta file
Message-ID: <442BFD56.9010908@pcm.uam.es>

Hello,

I'm trying to index the fasta file of the PDB database with dbifasta 
command and I get a lot of warnings as:

Warning: Duplicate ID skipped: '1FNT_A' All hits will point to first ID 
found

I have been looking the PDB fasta file and I see that, for the previous 
warning, there are an entry whoose id is '1FNT_A' and another one whoose 
id is '1FNT_a'. Then, this make me think that EMBOSS is 
case-insensitive. Is this true? Are there any way to distinguish between 
the two id's?

Thanks in advance,

Enrique.


From pmr at ebi.ac.uk  Thu Mar 30 21:47:19 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Thu, 30 Mar 2006 22:47:19 +0100 (BST)
Subject: [EMBOSS] A note about fastA format(s) - Checked by AntiVir DEMO
 version -
In-Reply-To: <20060330093723.GA18690@bigben.ulb.ac.be>
References: <20060330093723.GA18690@bigben.ulb.ac.be>
Message-ID: <50335.68.153.173.207.1143755239.squirrel@webmail.ebi.ac.uk>

Dear Guy,

> We are using EMBOSS version 3.0. One of my colleagues tried to use a
> multiple sequence file in fastA format, where each comment line starts
> with a string containing multiple pipe signs. An USA of type
> fasta::file:xx|yy|zz|uu|ss
> did not work. After some trial I found that putting "pearson" instead of
> "fasta" helped. This is strange, since according to the on-line manual at
> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
> "fasta" and "pearson" are synonyms. Here it seems that "fasta" is instead
> treated the same as "ncbi". Comments ?

Yes, that is indeed true. We had to make chanhes to support various NCBI
formats, and made FASTA and NCBI the same. We kept "pearson" as the
original plain fasta format.

We will update the documentation - it will take a little time to check for
any other changes to the formats.

regards,

Peter


From ajb at ebi.ac.uk  Fri Mar 31 12:12:53 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 31 Mar 2006 13:12:53 +0100 (BST)
Subject: [EMBOSS] crash on intel-Mac
In-Reply-To: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
References: <E24BA334-87A3-4EE1-91D7-C63B1A02BA63@earthlink.net>
Message-ID: <51078.81.98.244.247.1143807173.squirrel@webmail.ebi.ac.uk>

This should now be fixed as long as you apply all the fixes to EMBOSS-3.0.0
from the directory:

    ftp://emboss.open-bio.org/pub/EMBOSS/fixes/

The latest file there is a new 'configure' however, if you've not
applied previous patches in the above directory as well, then you'll get
compilation failure. Look at the README for details of what the
patches fix.

Thanks to Bill van Etten for previous emails on this.

Changes to the CVS developers version will follow.

Alan