From wo.granon at gmail.com  Wed Feb  2 14:02:46 2011
From: wo.granon at gmail.com (Wolfgang Gruber)
Date: Wed, 2 Feb 2011 20:02:46 +0100
Subject: [EMBOSS] Mistake in Appdoc Edialign?
Message-ID: <AANLkTims=vgX=pRjQ4_Ngnypw7s4-+UyP0cLvUHLKEvQ@mail.gmail.com>

Hello,

I studied the papers for DIALIGN and only in the newest Version
DIALIGN-TX (Subramanian u. a., 2008) I can find the information that
DIALIGN uses a guide tree. In the appdoc to edialign I read that
emboss uses DIALIGN2. In this Publikation (Morgenstern, 1999) I cannot
find an information that a guide tree is used. Also in the original
DIALIGN2 documentation I read: "This tree is constructed by applying
the UPGMA clustering method to the DIALIGN similarity scores." but
nothing that this tree is used for guiding.

So is this information in the emboss appdoc incorrect?

At all: is there a plan to update to DIALIGN-TX?

Thanks,
Wolfgang


Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment. In: Bioinformatics (Oxford,
England) Bd.?15 (1999), Nr.?3, S.?211-218.  ??PMID: 10222408
Subramanian, A. ; Kaufmann, M. ; Morgenstern, B.: DIALIGN-TX: greedy
and progressive approaches for segment-based multiple sequence
alignment. In: Algorithms for Molecular Biology Bd.?3 (2008), Nr.?1,
S.?6


From oliver.liegmann at biologie.uni-freiburg.de  Fri Feb  4 02:53:38 2011
From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann)
Date: Fri, 04 Feb 2011 08:53:38 +0100
Subject: [EMBOSS] seqret does not find sequence after update
Message-ID: <1296806018.12454.29.camel@yoda>

Dear list members,

does some of you also got this problem (and probably has an idea on what's going wrong):

After upgrading from version 6.2.0 to 6.3.1 seqret does not work
properly anymore:
First, Emboss was installed using
./configure --enable-64 --prefix=/opt/emboss 
make
make install

The database was set up with:
dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas

Using seqret to retrieve the sequences produces an error:
seqret plafa_test:PLAFA_MAL13P1.23-b
Reads and writes (returns) sequences
output sequence(s) [plafa_mal13p1.fasta]:
Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a'

Only the remaining two sequences are stored in the output file.

Are the allowed characters used in the accession changed? With Emboss
6.2.0 we did not have any problems, but after upgrade a huge bunch of
sequences could not be retrieved anymore when used with our internal fasta database, although the output in
outfile.dbifasta shows all sequences to be inserted into the database.


The content of the different files are:
PLAFA_test.fas:
>PLAFA_MAL13P1.23-b
MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH
>PLAFA_MAL13P1.237a
MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL
>PLAFA_MAL13P1.23-a
MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH


test.txt:
plafa:PLAFA_MAL13P1.23-b
plafa:PLAFA_MAL13P1.237a
plafa:PLAFA_MAL13P1.23-a


emboss.default:
DB plafa [
	format: fasta
	method: emblcd
	directory: /home/liegmann/genomezoo/emboss/prob/test/db
	type: P
]


Best regards,
Oliver Liegmann

-- 
Dipl.-Inf. Oliver Liegmann

AG Rensing
Fakult?t f?r Biologie
Albert-Ludwigs-Universit?t Freiburg
Hauptstra?e 1
D-79104 Freiburg

+49 761 203-2521
oliver.liegmann at biologie.uni-freiburg.de
http://www.plantco.de/people/Oliver.html


-------------- next part --------------
A non-text attachment was scrubbed...
Name: outfile.dbifasta
Type: application/octet-stream
Size: 848 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20110204/3abfcd4c/attachment.obj>

From Caroline.Barretto at rdls.nestle.com  Tue Feb  8 05:01:28 2011
From: Caroline.Barretto at rdls.nestle.com (Barretto, Caroline, LAUSANNE,
	BioInformatics)
Date: Tue, 8 Feb 2011 11:01:28 +0100
Subject: [EMBOSS] diffseq memory problem?
Message-ID: <BDDDF6E121C51A4080F4B7E5605E2FBB347C88@HQVEVE0014.nestle.com>

Dear EMBOSS developers,

 
I have been using diffseq to compare too strains of the same bacteria
species using "10" as wordsize without any problem. 

However, when I try to reduce this number to "4", after several hours of
calculation the server collapses, all RAM and SWAP are used.

Is there any option to avoid that, or do you know if someone is working
on that problem?

 
Many thanks,

 
Best regards,

 
Caroline.

 
From pmr at ebi.ac.uk  Tue Feb  8 05:46:32 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 08 Feb 2011 10:46:32 +0000
Subject: [EMBOSS] diffseq memory problem?
In-Reply-To: <BDDDF6E121C51A4080F4B7E5605E2FBB347C88@HQVEVE0014.nestle.com>
References: <BDDDF6E121C51A4080F4B7E5605E2FBB347C88@HQVEVE0014.nestle.com>
Message-ID: <4D511F08.7010206@ebi.ac.uk>

Dear Caroline,

On 08/02/2011 10:01, Barretto, Caroline, LAUSANNE, BioInformatics wrote:
> Dear EMBOSS developers,
>
> I have been using diffseq to compare too strains of the same bacteria
> species using "10" as wordsize without any problem.
>
> However, when I try to reduce this number to "4", after several hours of
> calculation the server collapses, all RAM and SWAP are used.
>
> Is there any option to avoid that, or do you know if someone is working
> on that problem?

Depending on the input size, and the number of simple repeats, a low 
word size could easily generate too many matches for large sequence lengths.

We would recommend reducing the word size more slowly (maybe 10, 8, 6).

As a guideline, finding more matches than there are non-overlapping 
words in the sequence is unlikely to be useful and is a reasonable point 
to stop reducing the word size.

Meanwhile, we will take a look at diffseq in case there is some way to 
improve its performance or to warn an early stage if the word size 
appears small for the input sequence lengths and may generate too many 
matches.

Hope this helps

Peter Rice
EMBOSS Team

From WulfDirk.Leuschner at sanofi-aventis.com  Thu Feb 10 02:43:17 2011
From: WulfDirk.Leuschner at sanofi-aventis.com (WulfDirk.Leuschner at sanofi-aventis.com)
Date: Thu, 10 Feb 2011 08:43:17 +0100
Subject: [EMBOSS] lit. references for EMBOSS data files,
	e.g. Epk.dat (iep usage)
Message-ID: <650F10565E484347B51CF6679663A80402F51F19@ffpw10.f2.enterprise>

Hi all, 
I was wondering whether someone might know something about how some of
the meta data used in EMBOSS were compiled. A colleague of mine was
looking for a reference for the Epk.dat values used for the
determination of the isoelectric point of a protein. However, neither
she nor I could find anything... Any hints?
Wulf Dirk Leuschner 


From jison at ebi.ac.uk  Thu Feb 10 11:42:34 2011
From: jison at ebi.ac.uk (Jon Ison)
Date: Thu, 10 Feb 2011 16:42:34 -0000 (UTC)
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D41F992.5030900@dartmouth.edu>
References: <4D41F992.5030900@dartmouth.edu>
Message-ID: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>

Hi Lionel

Didn't see a reply to you, sorry.

Anyhow, dreg will search the sequence as given.  This is taken as the sense/coding/+ strand.

If you specify -sreverse (which is available to any applications that read sequences) it will I
think search the reverse complement of that sequence instead.

Cheers

Jon


> Hello fellow EMBOSS fans,
>
> I am using the dreg program to search the human genome for my favorite
> motif.  I was unable to find any information regarding the meaning of
> the strand information in the output.  Does dreg search both strands or
> will it always return "+" as the strand designation of the hits that it
> finds?
>
> Thanks for your continued support and development of this fantastic tool!
>
> Sincerely,
> Lionel "Lee" Brooks 3rd
> Dartmouth Genetics Grad Student
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From sigve.nakken at medisin.uio.no  Fri Feb 11 05:54:44 2011
From: sigve.nakken at medisin.uio.no (Sigve Nakken)
Date: Fri, 11 Feb 2011 11:54:44 +0100
Subject: [EMBOSS] DNA sequence as input argument
Message-ID: <4D551574.2080004@medisin.uio.no>

Hi,

Is there any way in which one can read a DNA sequence directly from the 
command line (that is
as a string input argument) rather than from a file? I am especially 
interested in finding repeats,
inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead 
of creating a FASTA file
for each query sequene, I would like to read the sequence directly from 
the command line. Is this possible?

Kind regards,
Sigve

From pmr at ebi.ac.uk  Fri Feb 11 06:11:13 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 11 Feb 2011 11:11:13 +0000
Subject: [EMBOSS] DNA sequence as input argument
In-Reply-To: <4D551574.2080004@medisin.uio.no>
References: <4D551574.2080004@medisin.uio.no>
Message-ID: <4D551951.7030406@ebi.ac.uk>

Dear Sigve,

On 11/02/2011 10:54, Sigve Nakken wrote:
> Hi,
>
> Is there any way in which one can read a DNA sequence directly from the
> command line (that is
> as a string input argument) rather than from a file? I am especially
> interested in finding repeats,
> inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead
> of creating a FASTA file
> for each query sequene, I would like to read the sequence directly from
> the command line. Is this possible?

seqret asis::ctgatcgatgctagctgac

the "asis" format was included exactly for this purpose. You do need to 
take care that a long sequence is not too long for your shell to handle 
on the command line (a shell issue, not an EMBOSS issue).

You can also add to the command line:

-sid abc123

This will give it an ID of abc123 and the output file will default to 
(for seqret) abc123.fasta and will have the abc123 identifier in it.

Hope this helps

Peter Rice
EMBOSS Team

From stephen.taylor at imm.ox.ac.uk  Fri Feb 11 06:13:45 2011
From: stephen.taylor at imm.ox.ac.uk (Steve Taylor)
Date: Fri, 11 Feb 2011 11:13:45 +0000
Subject: [EMBOSS] DNA sequence as input argument
In-Reply-To: <4D551574.2080004@medisin.uio.no>
References: <4D551574.2080004@medisin.uio.no>
Message-ID: <4D5519E9.1050401@imm.ox.ac.uk>

Hi,

>
> Is there any way in which one can read a DNA sequence directly from the
> command line (that is
> as a string input argument) rather than from a file? I am especially
> interested in finding repeats,
> inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead
> of creating a FASTA file
> for each query sequene, I would like to read the sequence directly from
> the command line. Is this possible?
>

 From http://emboss.sourceforge.net/docs/faq.html


A) The "filename" is really the sequence. This is a quick and easy way of reading in a short fragment of sequence without having to enter it into a file.

For example:

    % program -seq asis::ATGGTGAGGAGAGTTGTGATGAGA


Steve

From Lionel.Brooks at dartmouth.edu  Fri Feb 11 13:22:38 2011
From: Lionel.Brooks at dartmouth.edu (Lionel (Lee) Brooks 3rd)
Date: Fri, 11 Feb 2011 13:22:38 -0500
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>
References: <4D41F992.5030900@dartmouth.edu>
	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>
Message-ID: <4D557E6E.3020608@dartmouth.edu>

Hi Jon,

Thank you!  Apparently, I should have rtfm more than once.

Sincerely,
Lionel

Jon Ison wrote:
> Hi Lionel
>
> Didn't see a reply to you, sorry.
>
> Anyhow, dreg will search the sequence as given.  This is taken as the sense/coding/+ strand.
>
> If you specify -sreverse (which is available to any applications that read sequences) it will I
> think search the reverse complement of that sequence instead.
>
> Cheers
>
> Jon
>
>
>
>   
>> Hello fellow EMBOSS fans,
>>
>> I am using the dreg program to search the human genome for my favorite
>> motif.  I was unable to find any information regarding the meaning of
>> the strand information in the output.  Does dreg search both strands or
>> will it always return "+" as the strand designation of the hits that it
>> finds?
>>
>> Thanks for your continued support and development of this fantastic tool!
>>
>> Sincerely,
>> Lionel "Lee" Brooks 3rd
>> Dartmouth Genetics Grad Student
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>>
>>     
>
>
>   

From pmr at ebi.ac.uk  Fri Feb 11 13:46:22 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 11 Feb 2011 18:46:22 +0000
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D557E6E.3020608@dartmouth.edu>
References: <4D41F992.5030900@dartmouth.edu>	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>
	<4D557E6E.3020608@dartmouth.edu>
Message-ID: <4D5583FE.1060703@ebi.ac.uk>

Dear Lee,

On 11/02/2011 18:22, Lionel (Lee) Brooks 3rd wrote:
> Hi Jon,
>
> Thank you! Apparently, I should have rtfm more than once.

True ... but it not obvious which part to (re-)read. We could make it 
easier.

Perhaps the "Input sequence" could point to the sequence qualifiers and 
the USA syntax. We will look at improving this part of the documentation 
in the next release.

regards,

Peter Rice
EMBOSS Team


From db60 at st-andrews.ac.uk  Sat Feb 12 07:07:03 2011
From: db60 at st-andrews.ac.uk (Daniel Barker)
Date: Sat, 12 Feb 2011 12:07:03 +0000
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D5583FE.1060703@ebi.ac.uk>
References: <4D41F992.5030900@dartmouth.edu>	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>	<4D557E6E.3020608@dartmouth.edu>
	<4D5583FE.1060703@ebi.ac.uk>
Message-ID: <4D5677E7.1080402@st-andrews.ac.uk>

Dear Peter,

A lot of the time for nucleotide stuff it makes sense to search both 
strands. Of course, it isn't hard to search one strand, then the other. 
But this introduces an extra step. I wonder if there could be some 
convenient option to do this, and if it should perhaps be the default? 
(As with NCBI blastall with any kind of nucleotide search.)

This would affect programs beyond just dreg and, though it would be OK 
for our work, perhaps it wouldn't make sense for others. Just a thought.

Best regards,

Daniel

-- 
Daniel Barker
http://bio.st-andrews.ac.uk/staff/db60.htm
The University of St Andrews is a charity registered in Scotland : No 
SC013532

From pmr at ebi.ac.uk  Sat Feb 12 11:58:32 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Sat, 12 Feb 2011 16:58:32 +0000
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D5677E7.1080402@st-andrews.ac.uk>
References: <4D41F992.5030900@dartmouth.edu>	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>	<4D557E6E.3020608@dartmouth.edu>	<4D5583FE.1060703@ebi.ac.uk>
	<4D5677E7.1080402@st-andrews.ac.uk>
Message-ID: <4D56BC38.9070408@ebi.ac.uk>

Dear Daniel,

On 12/02/2011 12:07, Daniel Barker wrote:
> A lot of the time for nucleotide stuff it makes sense to search both
> strands. Of course, it isn't hard to search one strand, then the other.
> But this introduces an extra step. I wonder if there could be some
> convenient option to do this, and if it should perhaps be the default?
> (As with NCBI blastall with any kind of nucleotide search.)
>
> This would affect programs beyond just dreg and, though it would be OK
> for our work, perhaps it wouldn't make sense for others. Just a thought.

Interesting suggestion.

Maybe we can add a -bothstrands option for applications to search the 
forward and reverse strands. We need to consider:

* Do the results make sense?
* What default do we set (maybe some programs have a different default)?
* Is this complicated for programs that can use DNA or protein input?
* Can we apply it to applications aligning two sequences?

meanwhile, running twice with -sreverse the second time will find you 
all the matches.

regards,

Peter Rice
EMBOSS Team

From david.bauer at bayer.com  Mon Feb 14 02:36:46 2011
From: david.bauer at bayer.com (david.bauer at bayer.com)
Date: Mon, 14 Feb 2011 08:36:46 +0100
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D56BC38.9070408@ebi.ac.uk>
Message-ID: <OFB8A628DC.F6C6BC09-ONC1257837.00278F8D-C1257837.0029D1D6@bayer.de>

Hi Daniel & Peter,

emboss-bounces at lists.open-bio.org schrieb am 12/02/2011 17:58:32:

> Dear Daniel,
> 
> On 12/02/2011 12:07, Daniel Barker wrote:
> > A lot of the time for nucleotide stuff it makes sense to search both
> > strands. Of course, it isn't hard to search one strand, then the 
other.
> > But this introduces an extra step. I wonder if there could be some
> > convenient option to do this, and if it should perhaps be the default?
> > (As with NCBI blastall with any kind of nucleotide search.)
> >
> > This would affect programs beyond just dreg and, though it would be OK
> > for our work, perhaps it wouldn't make sense for others. Just a 
thought.
> 

I think another candidate would be fuzznuc. 
(That's at least the program, where I sometimes missed this option ;-)

> Interesting suggestion.
> 
> Maybe we can add a -bothstrands option for applications to search the 
> forward and reverse strands.

Yes, this would add the new functionality without breaking the old default 
behaviour of the programs.

> We need to consider:
> * Do the results make sense?
> * What default do we set (maybe some programs have a different default)?

As mentioned above, I would not touch the old default settings and add 
searching both strands as an option.
(e.g. in stssearch the search of both strands is already the default.)

> * Is this complicated for programs that can use DNA or protein input?
> * Can we apply it to applications aligning two sequences?

I think it could make sense for programs which can align one sequence 
against a set of
other sequences (e.g. water, needle). 

Regards,
David.

From marvin.stodolsky at gmail.com  Mon Feb 14 18:35:45 2011
From: marvin.stodolsky at gmail.com (Marvin Stodolsky)
Date: Mon, 14 Feb 2011 18:35:45 -0500
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
Message-ID: <AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>

 This is elementary I?m sure, but I?ve been unable to work out the
syntax  from the documentation.
More minor issue.

When using infoseq to extract all the fasta Headers from a sequence
Repository, the GeneBegin..GeneEnd (like?? 234466..234589) often fails to
come as a uniform field/fields in a resultant spreadsheet.? Is there a Fix
for this?

MarvS


From pmr at ebi.ac.uk  Tue Feb 15 03:59:20 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 15 Feb 2011 08:59:20 +0000
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
	<AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
Message-ID: <4D5A4068.4000302@ebi.ac.uk>

On 14/02/2011 23:35, Marvin Stodolsky wrote:
>   This is elementary I?m sure, but I?ve been unable to work out the
> syntax  from the documentation.
> More minor issue.
>
> When using infoseq to extract all the fasta Headers from a sequence
> Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
> come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
> for this?

I don't see the genebegin and geneend in EMBOSS infoseq output. Are they 
part of the sequence ID in the FASTA file?

You can use a delimiter between items for infoseq using:

  -nocolumn

on the command line.

For import into a spreadsheet you can set the delimiter to be tab with:

  -nocolumn -delimiter "\t"

on the command line. That should then import nicely into a spreadsheet.

Hope that helps

Peter Rice
EMBOSS Team

From mathog at caltech.edu  Wed Feb 16 15:54:05 2011
From: mathog at caltech.edu (David Mathog)
Date: Wed, 16 Feb 2011 12:54:05 -0800
Subject: [EMBOSS] Transeq question, frame phases
Message-ID: <E1PpoNx-0006kR-B8@mendel.bio.caltech.edu>

Test case fasta file
>8Achars
AAAAAAAA

all 6 frames for transeq, standard mode emits:
>_1
KKX
>_2
KKX
>_3
KK
>_4
FF
>_5
FFX
>_6
FFX

But...

AAAAAAAA               Forward
TTTTTTTT                          Reverse
                         abc      cba  <--- codons in diagram
^a^^b^^c^    phase 1   1 KKX    4 XFF
x^a^^b^^c^   phase 2   2 KKX    5 XFF
xx^a^^b^     phase 3   3 KK     6  FF

That is, frames 4->6 are supposed to use, respectively, the same
set of codons as 1->3, but translate on the opposite strand, shouldn't
the number of residues returned be like in the table above, also with
the X at the beginning rather than the end? Or to put it another way,
shouldn't the little "x" bases be ignored on the - strand if they were
also ignored on the +?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

From mathog at caltech.edu  Wed Feb 16 17:47:15 2011
From: mathog at caltech.edu (David Mathog)
Date: Wed, 16 Feb 2011 14:47:15 -0800
Subject: [EMBOSS] Transeq question, frame phases
Message-ID: <E1Ppq9T-0006pO-7W@mendel.bio.caltech.edu>

Here is another worked example with a small but real mRNA fragment.
(Best cut and paste it into a program with a fixed width font).

Test sequence:

>for  (AKA gi|1728|emb|V00893.1, this is "+" direction)
TCGAAAACCGGGCCATGAAGGATGAGGAGAAGATGGAGCTGCA
GGAGATGCAGCTGAAGGAGGCCAAGCACATTGCCGAGGACTCA
GACCGCAAATACGAGGAGGTGGCCAGGAAGCTGGTGATCCTCGA

>rev (for reversed)
TCGAGGATCACCAGCTTCCTGGCCACCTCCTCGTATTTGCGGT
CTGAGTCCTCGGCAATGTGCTTGGCCTCCTTCAGCTGCATCTC
CTGCAGCTCCATCTTCTCCTCATCCTTCATGGCCCGGTTTTCGA

Transeq output, all 6 frames, for >for and >rev
>for_1
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX
>for_2
RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>for_3
ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX
>for_4
RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>for_5
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX
>for_6
EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX
>rev_1
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX
>rev_2
RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>rev_3
EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX
>rev_4
RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>rev_5
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX
>rev_6
ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX

Output from a different program, all 12 frame options
shown on the fasta header line as: 

  phase(strand)

Positive phases are measured from sequence position 1. 
Negative phases measured from sequence position
N, the last base in the sequence. 
This program differs from transeq in that any
partial codon is emitted as an X.  Note how
transeq output never starts with an X, whereas
here the X maintains its position on the
Nucleic acid sequence, for instance, +1(+) and +1(-).

>gi|1728|emb|V00893.1|[+1(+)] 
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX
>gi|1728|emb|V00893.1|[+2(+)] 
RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>gi|1728|emb|V00893.1|[+3(+)] 
ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX
>gi|1728|emb|V00893.1|[+1(-)] 
XRGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>gi|1728|emb|V00893.1|[+2(-)] 
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFS
>gi|1728|emb|V00893.1|[+3(-)] 
XEDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVF
>gi|1728|emb|V00893.1|[-1(-)] 
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX
>gi|1728|emb|V00893.1|[-2(-)] 
RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>gi|1728|emb|V00893.1|[-3(-)] 
EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX
>gi|1728|emb|V00893.1|[-1(+)] 
XRKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>gi|1728|emb|V00893.1|[-2(+)] 
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SS
>gi|1728|emb|V00893.1|[-3(+)] 
XENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVIL
>gi|1728|emb|V00893.1| 

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

From marvin.stodolsky at gmail.com  Wed Feb 16 21:07:51 2011
From: marvin.stodolsky at gmail.com (Marvin Stodolsky)
Date: Wed, 16 Feb 2011 21:07:51 -0500
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <4D5A4068.4000302@ebi.ac.uk>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
	<AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
	<4D5A4068.4000302@ebi.ac.uk>
Message-ID: <AANLkTimVesuYCH4RnUaOZZvFwpKek8Gy_y6hWxb8Gt7w@mail.gmail.com>

All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
problem has been worked out, per the Attachment, for those interested.

But for me the more important problem is making a FASTA repository,
which is a subset of the gene files in a much larger Repository.  This
is desirable before & after using Usearch -
http://www.drive5.com/usearch/intro.html
to select out a minimally homologous gene set of a species.
Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
the undesirables.

Specifically, is the command using ENTRET or relatives , to accept a list like
637008924
637008927
640691430
640691431
637008928
637008954
637008980
for extraction and repacking into a single smaller Repository?

If not, could you recommend a software tool/suite for this type of job.

MarvS

On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>
>> ?This is elementary I?m sure, but I?ve been unable to work out the
>> syntax ?from the documentation.
>> More minor issue.
>>
>> When using infoseq to extract all the fasta Headers from a sequence
>> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to
>> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix
>> for this?
>
> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
> part of the sequence ID in the FASTA file?
>
> You can use a delimiter between items for infoseq using:
>
> ?-nocolumn
>
> on the command line.
>
> For import into a spreadsheet you can set the delimiter to be tab with:
>
> ?-nocolumn -delimiter "\t"
>
> on the command line. That should then import nicely into a spreadsheet.
>
> Hope that helps
>
> Peter Rice
> EMBOSS Team
>


From biopython at maubp.freeserve.co.uk  Thu Feb 17 06:05:14 2011
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Feb 2011 11:05:14 +0000
Subject: [EMBOSS] Transeq question, frame phases
In-Reply-To: <E1PpoNx-0006kR-B8@mendel.bio.caltech.edu>
References: <E1PpoNx-0006kR-B8@mendel.bio.caltech.edu>
Message-ID: <AANLkTimoY8HgrXSbDxTGwoiTx260c+ZsojKJXrgiWG-4@mail.gmail.com>

On Wed, Feb 16, 2011 at 8:54 PM, David Mathog <mathog at caltech.edu> wrote:
> Test case fasta file
>>8Achars
> AAAAAAAA
>
> all 6 frames for transeq, standard mode emits:
>>_1
> KKX
>>_2
> KKX
>>_3
> KK
>>_4
> FF
>>_5
> FFX
>>_6
> FFX
>

Note you can do that with a single command line:

$ transeq asis:AAAAAAAA -filter -frame 6
>asis_1
KKX
>asis_2
KKX
>asis_3
KK
>asis_4
FF
>asis_5
FFX
>asis_6
FFX

Note that while using 1, 2, 3 for the forward frames is well defined, there
are two conventions for the reverse frame - do you start from the left or
the right?

First let's just do the forward frames,

$ transeq asis:AAAAAAAA -filter -frame 1
>asis_1
KKX
$ transeq asis:AAAAAAAA -filter -frame 2
>asis_2
KKX
$ transeq asis:AAAAAAAA -filter -frame 3
>asis_3
KK

Are you happy with them?

Now let's do that with the reverse complement strand:

$ transeq asis:TTTTTTTT -filter -frame 1
>asis_1
FFX
$ transeq asis:TTTTTTTT -filter -frame 2
>asis_2
FFX
$ transeq asis:TTTTTTTT -filter -frame 3
>asis_3
FF

Now let's do that with the original sequence but the negative frames:

$ transeq asis:AAAAAAAA -filter -frame -3
>asis_6
FFX
$ transeq asis:AAAAAAAA -filter -frame -2
>asis_5
FFX
$ transeq asis:AAAAAAAA -filter -frame -1
>asis_4
FF

Same results - perhaps the naming isn't as you expected?

Peter

From oliver.liegmann at biologie.uni-freiburg.de  Thu Feb 17 08:16:46 2011
From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann)
Date: Thu, 17 Feb 2011 14:16:46 +0100
Subject: [EMBOSS] seqret does not find sequence after update
In-Reply-To: <40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk>
References: <1296806018.12454.29.camel@yoda>
	<40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk>
Message-ID: <1297948606.14091.8.camel@yoda>

Hello,

thank you very much for your reply. dbifasta with "C" locales and
dbxfasta both seem to work well.

To answer your question: The operating system is Ubuntu 10.04 (Lucid)
with locales set to de_DE.utf8.

Best regards,
Oliver Liegmann

P.S.: I CC'ed this message to the list, for other users to know about
the workaround. Probably a short note should be written into the
documentation of dbifasta about the locales issue.

Am Freitag, den 04.02.2011, 19:34 +0000 schrieb ajb at ebi.ac.uk: 
> Hello,
> 
> I could reproduce your problem. It appears to be a manifestation of the
> GNU sort "sorting order". If you, depending on your shell, do:
> 
>   export LC_ALL=C
> 
> or
> 
>   setenv LC_ALL C
> 
> and then re-index using dbifasta then retrieval should work as expected.
> Alternatively use to dbx indexing system which does not rely on
> GNU sort.
> 
> Incidentally, what operating system and version are you using?
> 
> HTH
> 
> Alan Bleasby
> EBI
> 
> 
> 
> > Dear list members,
> >
> > does some of you also got this problem (and probably has an idea on what's
> > going wrong):
> >
> > After upgrading from version 6.2.0 to 6.3.1 seqret does not work
> > properly anymore:
> > First, Emboss was installed using
> > ./configure --enable-64 --prefix=/opt/emboss
> > make
> > make install
> >
> > The database was set up with:
> > dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas
> >
> > Using seqret to retrieve the sequences produces an error:
> > seqret plafa_test:PLAFA_MAL13P1.23-b
> > Reads and writes (returns) sequences
> > output sequence(s) [plafa_mal13p1.fasta]:
> > Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a'
> >
> > Only the remaining two sequences are stored in the output file.
> >
> > Are the allowed characters used in the accession changed? With Emboss
> > 6.2.0 we did not have any problems, but after upgrade a huge bunch of
> > sequences could not be retrieved anymore when used with our internal fasta
> > database, although the output in
> > outfile.dbifasta shows all sequences to be inserted into the database.
> >
> >
> > The content of the different files are:
> > PLAFA_test.fas:
> >>PLAFA_MAL13P1.23-b
> > MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH
> >>PLAFA_MAL13P1.237a
> > MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL
> >>PLAFA_MAL13P1.23-a
> > MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH
> >
> >
> > test.txt:
> > plafa:PLAFA_MAL13P1.23-b
> > plafa:PLAFA_MAL13P1.237a
> > plafa:PLAFA_MAL13P1.23-a
> >
> >
> > emboss.default:
> > DB plafa [
> > 	format: fasta
> > 	method: emblcd
> > 	directory: /home/liegmann/genomezoo/emboss/prob/test/db
> > 	type: P
> > ]
> >
> >
> >
> > Best regards,
> > Oliver Liegmann


-- 
Dipl.-Inf. Oliver Liegmann

AG Rensing
Fakult?t f?r Biologie
Albert-Ludwigs-Universit?t Freiburg
Hauptstra?e 1
D-79104 Freiburg

+49 761 203-2521
oliver.liegmann at biologie.uni-freiburg.de
http://www.plantco.de/people/Oliver.html

MOSS 2011 - the annual meeting on bryophyte research
http://plantco.de/MOSS2011/


From mathog at caltech.edu  Thu Feb 17 11:30:25 2011
From: mathog at caltech.edu (David Mathog)
Date: Thu, 17 Feb 2011 08:30:25 -0800
Subject: [EMBOSS] Transeq question, frame phases
Message-ID: <E1Pq6kL-00076N-SN@mendel.bio.caltech.edu>


> Now let's do that with the reverse complement strand:
> 
> $ transeq asis:TTTTTTTT -filter -frame 1
> >asis_1
> FFX
> $ transeq asis:TTTTTTTT -filter -frame 2
> >asis_2
> FFX
> $ transeq asis:TTTTTTTT -filter -frame 3
> >asis_3
> FF

That is the problem.  Let me try to explain more clearly what the issue is.


AAAAAAAA               Forward
TTTTTTTT                          Reverse
                         abc      cba  <--- codons in diagram
^a^^b^^c^    phase 1   1 KKX    4 XFF    EXPECTED
x^a^^b^^c^   phase 2   2 KKX    5 XFF    EXPECTED
xx^a^^b^     phase 3   3 KK     6  FF    EXPECTED


^a^^b^^c^    phase 1   1 KKX    4  FF    OBSERVED
x^a^^b^^c^   phase 2   2 KKX    5 FFX    OBSERVED
xx^a^^b^     phase 3   3 KK     6 FFX    OBSERVED

Assume an extra codon L to the left of a.
                         abc      baL  <--- codons in diagram
^a^^b^^c^    phase 1   1 KKX    4 FF     EXPLAINED?
^^a^^b^^c^   phase 2   2 KKX    5 FFX    EXPLAINED?
L^^a^^b^     phase 3   3 KK     6 FFX    EXPLAINED?


That is, if the meaning of the + phases is to define the three codons
a,b,c as shown in the diagram, such that the forward translation is as
shown, then the reverse translation should be as shown above in
expected.  That is, it is the translation of the exact same set of
codons done individually, but for the - strand reverse complement the
codon first, and then invert the resulting translated sequence.  That
way the X, where it occurs is attached to the same partial codon "c". 
What I think is happening in transeq is that it is starting with the
first full codon in the frame on the given strand.  In effect that
shifts the translated codons as shown in the "EXPLAINED?" section.

If partial codons were not translated then these would all be equivalent.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

From biopython at maubp.freeserve.co.uk  Thu Feb 17 12:03:15 2011
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Feb 2011 17:03:15 +0000
Subject: [EMBOSS] Transeq question, frame phases
In-Reply-To: <E1Pq6kL-00076N-SN@mendel.bio.caltech.edu>
References: <E1Pq6kL-00076N-SN@mendel.bio.caltech.edu>
Message-ID: <AANLkTi=t-Lc-fCuUp5n-XP-MZ8rcXeudQUxtkXDbuzSb@mail.gmail.com>

On Thu, Feb 17, 2011 at 4:30 PM, David Mathog <mathog at caltech.edu> wrote:
>
>
>> Now let's do that with the reverse complement strand:
>>
>> $ transeq asis:TTTTTTTT -filter -frame 1
>> >asis_1
>> FFX

This is what I think that does (forward frames are easy):

Frame 1, so starts at first base:
Letters 123, codon TTT, gives F
Letters 456, codon TTT, gives F
Letters 78, partial codon TT-, gives X

>> $ transeq asis:TTTTTTTT -filter -frame 2
>> >asis_2
>> FFX

Frame 2, so starts at second base:
Letter 1, just T, ignored
Letters 234, codon TTT, gives F
Letters 567, codon TTT, gives F
Letters 8, partial codon T--, gives X

>> $ transeq asis:TTTTTTTT -filter -frame 3
>> >asis_3
>> FF

Frame 3, so starts at third base:
Letters 12, bases TT, ignored
Letters 345, codon TTT, gives F
Letters 678, codon TTT, gives F


> That is the problem. ?Let me try to explain more clearly what the issue is.
>
> That is, if the meaning of the + phases is to define the three codons
> a,b,c as shown in the diagram, such that the forward translation is as
> shown, then the reverse translation should be as shown above in
> expected.  That is, it is the translation of the exact same set of
> codons done individually, but for the - strand reverse complement the
> codon first, and then invert the resulting translated sequence.  That
> way the X, where it occurs is attached to the same partial codon "c".

I couldn't understand your diagram - probably font spacing issues in part.

The EMBOSS tool is doing all six frames, maybe all you need to work out
the is mapping between its naming and yours.

Note that it can make sense to translate a trailing partial codon, e.g.
TC... could be TCA, TCC, TCG or TCT which all code for S:

$ transeq asis:TCN -filter
>asis_1
S
$ transeq asis:TC -filter
>asis_1
S

Peter


From marvin.stodolsky at gmail.com  Thu Feb 17 21:23:15 2011
From: marvin.stodolsky at gmail.com (Marvin Stodolsky)
Date: Thu, 17 Feb 2011 21:23:15 -0500
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
	<AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
	<4D5A4068.4000302@ebi.ac.uk>
	<AANLkTimVesuYCH4RnUaOZZvFwpKek8Gy_y6hWxb8Gt7w@mail.gmail.com>
	<825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu>
Message-ID: <AANLkTimaSiEnPX4O6rGWBqLaRoi=kA7kiGGHesEk4h50@mail.gmail.com>

Sorry,

Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.

MarvS

On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller <kellert at ohsu.edu> wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I received. Would you mind sending it?
>
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
>
>
>
>
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>
>> All thanks for the suggestions. ?A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>>
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository. ?This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>>
>> Specifically, is the command using ENTRET or relatives , to accept a list like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>>
>> If not, could you recommend a software tool/suite for this type of job.
>>
>> MarvS
>>
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>>>
>>>> ?This is elementary I?m sure, but I?ve been unable to work out the
>>>> syntax ?from the documentation.
>>>> More minor issue.
>>>>
>>>> When using infoseq to extract all the fasta Headers from a sequence
>>>> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to
>>>> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix
>>>> for this?
>>>
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>>
>>> You can use a delimiter between items for infoseq using:
>>>
>>> ?-nocolumn
>>>
>>> on the command line.
>>>
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>
>>> ?-nocolumn -delimiter "\t"
>>>
>>> on the command line. That should then import nicely into a spreadsheet.
>>>
>>> Hope that helps
>>>
>>> Peter Rice
>>> EMBOSS Team
>>>
>>
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
-------------- next part --------------


With respect to using info in the FASTA description field, the intent and partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., 
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0

DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]

In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.

Pulling the dataset into a spreadsheet, spaces in the description field were  next replaced with >< :
DnaJ><domain><protein><1828..2760(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)><[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)><[Mycoplasma><genitalium><G37]
thymidylate 8551..9183(+)><[Mycoplasma><genitalium><G37]

Next ><[ is replace by "to be field seperator"  |[
DNA><polymerase><III,><beta><subunit><686..1828(+)|[Mycoplasma><genitalium><G37]
DnaJ><domain><protein><1828..2760(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)|[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)|[Mycoplasma><genitalium><G37]
and the file saved as:   Myc637000176m.csv 

to get rid of >< in the terminal common  [Mycoplasma><genitalium><G37], there was done 
$  cut -d"[" -f1 Myc637000176m.csv > Myc637000176m2.csv
resulting in :
DNA><polymerase><III,><beta><subunit><686..1828(+)|
DnaJ><domain><protein><1828..2760(+)|
DNA><gyrase><subunit><B><2845..4797(+)|
DNA><gyrase><subunit><A><4812..7322(+)|
seryl-tRNA><synthetase><7294..8547(+)|

internals are next mostly deleted with:
sed -e 's/<.*>//g'  Myc637000176m2.csv > Myc637000176m3.csv
resulting in:
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|

The single remmaining >< is replaced with potential separator | 
sed -e 's/></|/g'  Myc637000176m3.csv > Myc637000176m4.csv
resulting in:
DNA|686..1828(+)|
DnaJ|1828..2760(+)|
DNA|2845..4797(+)|
DNA|4812..7322(+)|
seryl-tRNA|7294..8547(+)|
BASICALLY, the clever work is now done, and the rest is more routine manipulation.

A cleanup was done with:
sed -e 's/)|//g'  Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g'  Myc637000176m5.csv > Myc637000176m6.csv
together changing the  (+)|  to   |+   ,that is a separated field

The replacement of the residual  ..  with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file

After routine manipulations within the spread sheet, 
a view of the overlap detection section is:
 F       G      H                I               J  fields
Start 	End Begin-nextEnd  OR((H2<0),(H1<0))  Stable 0/1 Value, for SORTING on		
686	1828	0		FALSE		0		
1828	2760	85		FALSE		0		
2845	4797	15		TRUE		0		
4812	7322	-28		TRUE		1		
7294	8547	4		TRUE		1		
8551	9183	-27		TRUE		1
9156	9920	3		FALSE		0	
9923	11251	0		FALSE		0	

The overlapping genes have stable value 1,during  sorting, while field I FALSE/TRUE and not stable during SORTing
		

From egorleg at gmail.com  Tue Feb 22 19:56:40 2011
From: egorleg at gmail.com (Kevin Egan)
Date: Wed, 23 Feb 2011 00:56:40 +0000
Subject: [EMBOSS] Dot-matcher
In-Reply-To: <AANLkTi=Ua1uJk4Pb=oZOc-s+UUArAQVkBEKLABOcGSHJ@mail.gmail.com>
References: <mailman.9199.1298420178.2958.emboss@lists.open-bio.org>
	<AANLkTi=Ua1uJk4Pb=oZOc-s+UUArAQVkBEKLABOcGSHJ@mail.gmail.com>
Message-ID: <AANLkTi=Uggy3csMtVLqAMNZx0zk1WgD=42vyy36rPML7@mail.gmail.com>

Hi

I was wondering is there anywhere I could find the source code for
dot-matcher?

From uludag at ebi.ac.uk  Wed Feb 23 04:50:28 2011
From: uludag at ebi.ac.uk (Mahmut Uludag)
Date: Wed, 23 Feb 2011 09:50:28 +0000
Subject: [EMBOSS] Dot-matcher
In-Reply-To: <AANLkTi=Uggy3csMtVLqAMNZx0zk1WgD=42vyy36rPML7@mail.gmail.com>
References: <mailman.9199.1298420178.2958.emboss@lists.open-bio.org>
	<AANLkTi=Ua1uJk4Pb=oZOc-s+UUArAQVkBEKLABOcGSHJ@mail.gmail.com>
	<AANLkTi=Uggy3csMtVLqAMNZx0zk1WgD=42vyy36rPML7@mail.gmail.com>
Message-ID: <1298454628.8626.3.camel@emboss1.ebi.ac.uk>

Hi Kevin,

> I was wondering is there anywhere I could find the source code for
> dot-matcher?

EMBOSS release tarballs include source files for EMBOSS applications and
for EMBOSS libraries.

    ftp://emboss.open-bio.org/pub/EMBOSS/

Regards,
Mahmut


From jmeador45 at mac.com  Tue Feb  1 03:43:50 2011
From: jmeador45 at mac.com (Jim Meador)
Date: Mon, 31 Jan 2011 22:43:50 -0500
Subject: [EMBOSS] can't install emboss 6.3.1 in mac os x 10.6
In-Reply-To: <C96C47E2.16F2%idrummon@receptor.mgh.harvard.edu>
References: <C96C47E2.16F2%idrummon@receptor.mgh.harvard.edu>
Message-ID: <2D54C5BB-C798-43C5-8EBA-2CCC9DEEC740@mac.com>

Hi Iain,
I think you have the right idea. I had installed XQuartz which actually fixed some other issues I was having (but created new ones ;-) and I think I need to re-configure with x11=/opt/x11 or something like that. 
So I think you have the answer and thank you for the help.
Sincerely,
Jim (just around the corner in Cambridge)


On Jan 31, 2011, at 11:05 AM, Iain Drummond wrote:

> I seem to remember that i could bypass this issue by configuring EMBOSS to
> install without x11. I realize that's not what you want to do. I think the
> problem stems from the directory structure Apple uses for x11; i.e. Its not
> where EMBOSS thinks it is. Maybe Apple put xll in a new place with 10.6?
> 
> Iain Drummond
> 
> 
> On 1/30/11 11:49 PM, "Jim Meador" <jmeador45 at mac.com> wrote:
> 
>> Hi Everyone,
>> 
>> I am wanting to pgrade EMBOSS 6.3.1 on Mac OS X 10.6.6 (MacBook Pro 2.2 GHz,
>> 4GB) from EMBOSS 6.2.0 that was installed in Leopoard (10.5) before I upgraded
>> to Snow Leopard (10.6) and it seems to not do the make process correctly,
>> where in the plplot directory, the .libs directory does not get created, so
>> when I do sudo make install, it fails with these error messages:
>> 
>> Making install in plplot
>> make[1]: Entering directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot'
>> Making install in lib
>> make[2]: Entering directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot/lib'
>> make[3]: Entering directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot/lib'
>> make[3]: Nothing to be done for `install-exec-am'.
>> test -z "/usr/local/share/EMBOSS" || ../.././install-sh -c -d
>> "/usr/local/share/EMBOSS"
>> /usr/bin/install -c -m 644 plstnd5.fnt plxtnd5.fnt '/usr/local/share/EMBOSS'
>> make[3]: Leaving directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot/lib'
>> make[2]: Leaving directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot/lib'
>> make[2]: Entering directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot'
>> make[3]: Entering directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot'
>> test -z "/usr/local/lib" || .././install-sh -c -d "/usr/local/lib"
>> /bin/sh ../libtool   --mode=install /usr/bin/install -c   libeplplot.la
>> '/usr/local/lib'
>> libtool: install: /usr/bin/install -c .libs/libeplplot.3.dylib
>> /usr/local/lib/libeplplot.3.dylib
>> install: .libs/libeplplot.3.dylib: No such file or directory
>> make[3]: *** [install-libLTLIBRARIES] Error 71
>> make[3]: Leaving directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot'
>> make[2]: *** [install-am] Error 2
>> make[2]: Leaving directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot'
>> make[1]: *** [install-recursive] Error 1
>> make[1]: Leaving directory
>> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p
>> lplot'
>> make: *** [install-recursive] Error 1
>> 
>> When I look in the source directory, /EMBOSS-6.3.1/plplot/ there is no .libs
>> directory as there is in another installation on a newer MacBook Pro that I
>> was able to successfully install this same software (10.6.6, 2.6 GHz, 8GB).
>> 
>> My setup is a little more complicated than I would like, since I have
>> installed the eBiotools-3.0.1-leopard software which sets up an older version
>> of emboss 5.0.0 within a /usr/ebiotools/ directory and uses a very nice gui
>> program to access these older emboss programs, called "eBioX" (and I don't
>> want to lose this). So I have been installing the newer 6.x version of emboss
>> to use at the commandline and with kemboss, both of which work, sort of. I
>> have to play with the environment variables to get the text output programs to
>> work, but I cannot get the graphics to work from various emboss programs that
>> try to make graphs, such as "charge". The text-based programs work but I want
>> to get the graphics working as well as it does from the ebiotools versions and
>> will probably need to re-make emboss 6.3.1 after (re)installing gd and libpng.
>> However, on this mac, neither ps nor x11 work. On the newer mac, with 6.3.1
>> installed, I don't have png or gif support, but I can at least get g!
>> raphs in ps and x11 to work.
>> 
>> Does anyone have any ideas of what I may be doing wrong? Is it possible that
>> some environment variable could be causing this?
>> 
>> Any ideas will be greatly appreciated.
>> Thanks,
>> Jim
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>> 
> 
> 
> 
> 
> The information in this e-mail is intended only for the person to whom it is
> addressed. If you believe this e-mail was sent to you in error and the e-mail
> contains patient information, please contact the Partners Compliance HelpLine at
> http://www.partners.org/complianceline . If the e-mail was sent to you in error
> but does not contain patient information, please contact the sender and properly
> dispose of the e-mail.
> 


From wo.granon at gmail.com  Wed Feb  2 19:02:46 2011
From: wo.granon at gmail.com (Wolfgang Gruber)
Date: Wed, 2 Feb 2011 20:02:46 +0100
Subject: [EMBOSS] Mistake in Appdoc Edialign?
Message-ID: <AANLkTims=vgX=pRjQ4_Ngnypw7s4-+UyP0cLvUHLKEvQ@mail.gmail.com>

Hello,

I studied the papers for DIALIGN and only in the newest Version
DIALIGN-TX (Subramanian u. a., 2008) I can find the information that
DIALIGN uses a guide tree. In the appdoc to edialign I read that
emboss uses DIALIGN2. In this Publikation (Morgenstern, 1999) I cannot
find an information that a guide tree is used. Also in the original
DIALIGN2 documentation I read: "This tree is constructed by applying
the UPGMA clustering method to the DIALIGN similarity scores." but
nothing that this tree is used for guiding.

So is this information in the emboss appdoc incorrect?

At all: is there a plan to update to DIALIGN-TX?

Thanks,
Wolfgang


Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment. In: Bioinformatics (Oxford,
England) Bd.?15 (1999), Nr.?3, S.?211-218.  ??PMID: 10222408
Subramanian, A. ; Kaufmann, M. ; Morgenstern, B.: DIALIGN-TX: greedy
and progressive approaches for segment-based multiple sequence
alignment. In: Algorithms for Molecular Biology Bd.?3 (2008), Nr.?1,
S.?6


From oliver.liegmann at biologie.uni-freiburg.de  Fri Feb  4 07:53:38 2011
From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann)
Date: Fri, 04 Feb 2011 08:53:38 +0100
Subject: [EMBOSS] seqret does not find sequence after update
Message-ID: <1296806018.12454.29.camel@yoda>

Dear list members,

does some of you also got this problem (and probably has an idea on what's going wrong):

After upgrading from version 6.2.0 to 6.3.1 seqret does not work
properly anymore:
First, Emboss was installed using
./configure --enable-64 --prefix=/opt/emboss 
make
make install

The database was set up with:
dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas

Using seqret to retrieve the sequences produces an error:
seqret plafa_test:PLAFA_MAL13P1.23-b
Reads and writes (returns) sequences
output sequence(s) [plafa_mal13p1.fasta]:
Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a'

Only the remaining two sequences are stored in the output file.

Are the allowed characters used in the accession changed? With Emboss
6.2.0 we did not have any problems, but after upgrade a huge bunch of
sequences could not be retrieved anymore when used with our internal fasta database, although the output in
outfile.dbifasta shows all sequences to be inserted into the database.


The content of the different files are:
PLAFA_test.fas:
>PLAFA_MAL13P1.23-b
MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH
>PLAFA_MAL13P1.237a
MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL
>PLAFA_MAL13P1.23-a
MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH


test.txt:
plafa:PLAFA_MAL13P1.23-b
plafa:PLAFA_MAL13P1.237a
plafa:PLAFA_MAL13P1.23-a


emboss.default:
DB plafa [
	format: fasta
	method: emblcd
	directory: /home/liegmann/genomezoo/emboss/prob/test/db
	type: P
]


Best regards,
Oliver Liegmann

-- 
Dipl.-Inf. Oliver Liegmann

AG Rensing
Fakult?t f?r Biologie
Albert-Ludwigs-Universit?t Freiburg
Hauptstra?e 1
D-79104 Freiburg

+49 761 203-2521
oliver.liegmann at biologie.uni-freiburg.de
http://www.plantco.de/people/Oliver.html


-------------- next part --------------
A non-text attachment was scrubbed...
Name: outfile.dbifasta
Type: application/octet-stream
Size: 848 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20110204/3abfcd4c/attachment-0002.obj>

From Caroline.Barretto at rdls.nestle.com  Tue Feb  8 10:01:28 2011
From: Caroline.Barretto at rdls.nestle.com (Barretto, Caroline, LAUSANNE,
	BioInformatics)
Date: Tue, 8 Feb 2011 11:01:28 +0100
Subject: [EMBOSS] diffseq memory problem?
Message-ID: <BDDDF6E121C51A4080F4B7E5605E2FBB347C88@HQVEVE0014.nestle.com>

Dear EMBOSS developers,

 
I have been using diffseq to compare too strains of the same bacteria
species using "10" as wordsize without any problem. 

However, when I try to reduce this number to "4", after several hours of
calculation the server collapses, all RAM and SWAP are used.

Is there any option to avoid that, or do you know if someone is working
on that problem?

 
Many thanks,

 
Best regards,

 
Caroline.

 
From pmr at ebi.ac.uk  Tue Feb  8 10:46:32 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 08 Feb 2011 10:46:32 +0000
Subject: [EMBOSS] diffseq memory problem?
In-Reply-To: <BDDDF6E121C51A4080F4B7E5605E2FBB347C88@HQVEVE0014.nestle.com>
References: <BDDDF6E121C51A4080F4B7E5605E2FBB347C88@HQVEVE0014.nestle.com>
Message-ID: <4D511F08.7010206@ebi.ac.uk>

Dear Caroline,

On 08/02/2011 10:01, Barretto, Caroline, LAUSANNE, BioInformatics wrote:
> Dear EMBOSS developers,
>
> I have been using diffseq to compare too strains of the same bacteria
> species using "10" as wordsize without any problem.
>
> However, when I try to reduce this number to "4", after several hours of
> calculation the server collapses, all RAM and SWAP are used.
>
> Is there any option to avoid that, or do you know if someone is working
> on that problem?

Depending on the input size, and the number of simple repeats, a low 
word size could easily generate too many matches for large sequence lengths.

We would recommend reducing the word size more slowly (maybe 10, 8, 6).

As a guideline, finding more matches than there are non-overlapping 
words in the sequence is unlikely to be useful and is a reasonable point 
to stop reducing the word size.

Meanwhile, we will take a look at diffseq in case there is some way to 
improve its performance or to warn an early stage if the word size 
appears small for the input sequence lengths and may generate too many 
matches.

Hope this helps

Peter Rice
EMBOSS Team


From WulfDirk.Leuschner at sanofi-aventis.com  Thu Feb 10 07:43:17 2011
From: WulfDirk.Leuschner at sanofi-aventis.com (WulfDirk.Leuschner at sanofi-aventis.com)
Date: Thu, 10 Feb 2011 08:43:17 +0100
Subject: [EMBOSS] lit. references for EMBOSS data files,
	e.g. Epk.dat (iep usage)
Message-ID: <650F10565E484347B51CF6679663A80402F51F19@ffpw10.f2.enterprise>

Hi all, 
I was wondering whether someone might know something about how some of
the meta data used in EMBOSS were compiled. A colleague of mine was
looking for a reference for the Epk.dat values used for the
determination of the isoelectric point of a protein. However, neither
she nor I could find anything... Any hints?
Wulf Dirk Leuschner 


From jison at ebi.ac.uk  Thu Feb 10 16:42:34 2011
From: jison at ebi.ac.uk (Jon Ison)
Date: Thu, 10 Feb 2011 16:42:34 -0000 (UTC)
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D41F992.5030900@dartmouth.edu>
References: <4D41F992.5030900@dartmouth.edu>
Message-ID: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>

Hi Lionel

Didn't see a reply to you, sorry.

Anyhow, dreg will search the sequence as given.  This is taken as the sense/coding/+ strand.

If you specify -sreverse (which is available to any applications that read sequences) it will I
think search the reverse complement of that sequence instead.

Cheers

Jon


> Hello fellow EMBOSS fans,
>
> I am using the dreg program to search the human genome for my favorite
> motif.  I was unable to find any information regarding the meaning of
> the strand information in the output.  Does dreg search both strands or
> will it always return "+" as the strand designation of the hits that it
> finds?
>
> Thanks for your continued support and development of this fantastic tool!
>
> Sincerely,
> Lionel "Lee" Brooks 3rd
> Dartmouth Genetics Grad Student
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From sigve.nakken at medisin.uio.no  Fri Feb 11 10:54:44 2011
From: sigve.nakken at medisin.uio.no (Sigve Nakken)
Date: Fri, 11 Feb 2011 11:54:44 +0100
Subject: [EMBOSS] DNA sequence as input argument
Message-ID: <4D551574.2080004@medisin.uio.no>

Hi,

Is there any way in which one can read a DNA sequence directly from the 
command line (that is
as a string input argument) rather than from a file? I am especially 
interested in finding repeats,
inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead 
of creating a FASTA file
for each query sequene, I would like to read the sequence directly from 
the command line. Is this possible?

Kind regards,
Sigve


From pmr at ebi.ac.uk  Fri Feb 11 11:11:13 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 11 Feb 2011 11:11:13 +0000
Subject: [EMBOSS] DNA sequence as input argument
In-Reply-To: <4D551574.2080004@medisin.uio.no>
References: <4D551574.2080004@medisin.uio.no>
Message-ID: <4D551951.7030406@ebi.ac.uk>

Dear Sigve,

On 11/02/2011 10:54, Sigve Nakken wrote:
> Hi,
>
> Is there any way in which one can read a DNA sequence directly from the
> command line (that is
> as a string input argument) rather than from a file? I am especially
> interested in finding repeats,
> inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead
> of creating a FASTA file
> for each query sequene, I would like to read the sequence directly from
> the command line. Is this possible?

seqret asis::ctgatcgatgctagctgac

the "asis" format was included exactly for this purpose. You do need to 
take care that a long sequence is not too long for your shell to handle 
on the command line (a shell issue, not an EMBOSS issue).

You can also add to the command line:

-sid abc123

This will give it an ID of abc123 and the output file will default to 
(for seqret) abc123.fasta and will have the abc123 identifier in it.

Hope this helps

Peter Rice
EMBOSS Team


From stephen.taylor at imm.ox.ac.uk  Fri Feb 11 11:13:45 2011
From: stephen.taylor at imm.ox.ac.uk (Steve Taylor)
Date: Fri, 11 Feb 2011 11:13:45 +0000
Subject: [EMBOSS] DNA sequence as input argument
In-Reply-To: <4D551574.2080004@medisin.uio.no>
References: <4D551574.2080004@medisin.uio.no>
Message-ID: <4D5519E9.1050401@imm.ox.ac.uk>

Hi,

>
> Is there any way in which one can read a DNA sequence directly from the
> command line (that is
> as a string input argument) rather than from a file? I am especially
> interested in finding repeats,
> inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead
> of creating a FASTA file
> for each query sequene, I would like to read the sequence directly from
> the command line. Is this possible?
>

 From http://emboss.sourceforge.net/docs/faq.html


A) The "filename" is really the sequence. This is a quick and easy way of reading in a short fragment of sequence without having to enter it into a file.

For example:

    % program -seq asis::ATGGTGAGGAGAGTTGTGATGAGA


Steve


From Lionel.Brooks at dartmouth.edu  Fri Feb 11 18:22:38 2011
From: Lionel.Brooks at dartmouth.edu (Lionel (Lee) Brooks 3rd)
Date: Fri, 11 Feb 2011 13:22:38 -0500
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>
References: <4D41F992.5030900@dartmouth.edu>
	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>
Message-ID: <4D557E6E.3020608@dartmouth.edu>

Hi Jon,

Thank you!  Apparently, I should have rtfm more than once.

Sincerely,
Lionel

Jon Ison wrote:
> Hi Lionel
>
> Didn't see a reply to you, sorry.
>
> Anyhow, dreg will search the sequence as given.  This is taken as the sense/coding/+ strand.
>
> If you specify -sreverse (which is available to any applications that read sequences) it will I
> think search the reverse complement of that sequence instead.
>
> Cheers
>
> Jon
>
>
>
>   
>> Hello fellow EMBOSS fans,
>>
>> I am using the dreg program to search the human genome for my favorite
>> motif.  I was unable to find any information regarding the meaning of
>> the strand information in the output.  Does dreg search both strands or
>> will it always return "+" as the strand designation of the hits that it
>> finds?
>>
>> Thanks for your continued support and development of this fantastic tool!
>>
>> Sincerely,
>> Lionel "Lee" Brooks 3rd
>> Dartmouth Genetics Grad Student
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>>
>>     
>
>
>   


From pmr at ebi.ac.uk  Fri Feb 11 18:46:22 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 11 Feb 2011 18:46:22 +0000
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D557E6E.3020608@dartmouth.edu>
References: <4D41F992.5030900@dartmouth.edu>	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>
	<4D557E6E.3020608@dartmouth.edu>
Message-ID: <4D5583FE.1060703@ebi.ac.uk>

Dear Lee,

On 11/02/2011 18:22, Lionel (Lee) Brooks 3rd wrote:
> Hi Jon,
>
> Thank you! Apparently, I should have rtfm more than once.

True ... but it not obvious which part to (re-)read. We could make it 
easier.

Perhaps the "Input sequence" could point to the sequence qualifiers and 
the USA syntax. We will look at improving this part of the documentation 
in the next release.

regards,

Peter Rice
EMBOSS Team


From db60 at st-andrews.ac.uk  Sat Feb 12 12:07:03 2011
From: db60 at st-andrews.ac.uk (Daniel Barker)
Date: Sat, 12 Feb 2011 12:07:03 +0000
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D5583FE.1060703@ebi.ac.uk>
References: <4D41F992.5030900@dartmouth.edu>	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>	<4D557E6E.3020608@dartmouth.edu>
	<4D5583FE.1060703@ebi.ac.uk>
Message-ID: <4D5677E7.1080402@st-andrews.ac.uk>

Dear Peter,

A lot of the time for nucleotide stuff it makes sense to search both 
strands. Of course, it isn't hard to search one strand, then the other. 
But this introduces an extra step. I wonder if there could be some 
convenient option to do this, and if it should perhaps be the default? 
(As with NCBI blastall with any kind of nucleotide search.)

This would affect programs beyond just dreg and, though it would be OK 
for our work, perhaps it wouldn't make sense for others. Just a thought.

Best regards,

Daniel

-- 
Daniel Barker
http://bio.st-andrews.ac.uk/staff/db60.htm
The University of St Andrews is a charity registered in Scotland : No 
SC013532


From pmr at ebi.ac.uk  Sat Feb 12 16:58:32 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Sat, 12 Feb 2011 16:58:32 +0000
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D5677E7.1080402@st-andrews.ac.uk>
References: <4D41F992.5030900@dartmouth.edu>	<46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk>	<4D557E6E.3020608@dartmouth.edu>	<4D5583FE.1060703@ebi.ac.uk>
	<4D5677E7.1080402@st-andrews.ac.uk>
Message-ID: <4D56BC38.9070408@ebi.ac.uk>

Dear Daniel,

On 12/02/2011 12:07, Daniel Barker wrote:
> A lot of the time for nucleotide stuff it makes sense to search both
> strands. Of course, it isn't hard to search one strand, then the other.
> But this introduces an extra step. I wonder if there could be some
> convenient option to do this, and if it should perhaps be the default?
> (As with NCBI blastall with any kind of nucleotide search.)
>
> This would affect programs beyond just dreg and, though it would be OK
> for our work, perhaps it wouldn't make sense for others. Just a thought.

Interesting suggestion.

Maybe we can add a -bothstrands option for applications to search the 
forward and reverse strands. We need to consider:

* Do the results make sense?
* What default do we set (maybe some programs have a different default)?
* Is this complicated for programs that can use DNA or protein input?
* Can we apply it to applications aligning two sequences?

meanwhile, running twice with -sreverse the second time will find you 
all the matches.

regards,

Peter Rice
EMBOSS Team


From david.bauer at bayer.com  Mon Feb 14 07:36:46 2011
From: david.bauer at bayer.com (david.bauer at bayer.com)
Date: Mon, 14 Feb 2011 08:36:46 +0100
Subject: [EMBOSS] dreg: does it search both strands?
In-Reply-To: <4D56BC38.9070408@ebi.ac.uk>
Message-ID: <OFB8A628DC.F6C6BC09-ONC1257837.00278F8D-C1257837.0029D1D6@bayer.de>

Hi Daniel & Peter,

emboss-bounces at lists.open-bio.org schrieb am 12/02/2011 17:58:32:

> Dear Daniel,
> 
> On 12/02/2011 12:07, Daniel Barker wrote:
> > A lot of the time for nucleotide stuff it makes sense to search both
> > strands. Of course, it isn't hard to search one strand, then the 
other.
> > But this introduces an extra step. I wonder if there could be some
> > convenient option to do this, and if it should perhaps be the default?
> > (As with NCBI blastall with any kind of nucleotide search.)
> >
> > This would affect programs beyond just dreg and, though it would be OK
> > for our work, perhaps it wouldn't make sense for others. Just a 
thought.
> 

I think another candidate would be fuzznuc. 
(That's at least the program, where I sometimes missed this option ;-)

> Interesting suggestion.
> 
> Maybe we can add a -bothstrands option for applications to search the 
> forward and reverse strands.

Yes, this would add the new functionality without breaking the old default 
behaviour of the programs.

> We need to consider:
> * Do the results make sense?
> * What default do we set (maybe some programs have a different default)?

As mentioned above, I would not touch the old default settings and add 
searching both strands as an option.
(e.g. in stssearch the search of both strands is already the default.)

> * Is this complicated for programs that can use DNA or protein input?
> * Can we apply it to applications aligning two sequences?

I think it could make sense for programs which can align one sequence 
against a set of
other sequences (e.g. water, needle). 

Regards,
David.


From marvin.stodolsky at gmail.com  Mon Feb 14 23:35:45 2011
From: marvin.stodolsky at gmail.com (Marvin Stodolsky)
Date: Mon, 14 Feb 2011 18:35:45 -0500
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
Message-ID: <AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>

 This is elementary I?m sure, but I?ve been unable to work out the
syntax  from the documentation.
More minor issue.

When using infoseq to extract all the fasta Headers from a sequence
Repository, the GeneBegin..GeneEnd (like?? 234466..234589) often fails to
come as a uniform field/fields in a resultant spreadsheet.? Is there a Fix
for this?

MarvS


From pmr at ebi.ac.uk  Tue Feb 15 08:59:20 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 15 Feb 2011 08:59:20 +0000
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
	<AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
Message-ID: <4D5A4068.4000302@ebi.ac.uk>

On 14/02/2011 23:35, Marvin Stodolsky wrote:
>   This is elementary I?m sure, but I?ve been unable to work out the
> syntax  from the documentation.
> More minor issue.
>
> When using infoseq to extract all the fasta Headers from a sequence
> Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
> come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
> for this?

I don't see the genebegin and geneend in EMBOSS infoseq output. Are they 
part of the sequence ID in the FASTA file?

You can use a delimiter between items for infoseq using:

  -nocolumn

on the command line.

For import into a spreadsheet you can set the delimiter to be tab with:

  -nocolumn -delimiter "\t"

on the command line. That should then import nicely into a spreadsheet.

Hope that helps

Peter Rice
EMBOSS Team


From mathog at caltech.edu  Wed Feb 16 20:54:05 2011
From: mathog at caltech.edu (David Mathog)
Date: Wed, 16 Feb 2011 12:54:05 -0800
Subject: [EMBOSS] Transeq question, frame phases
Message-ID: <E1PpoNx-0006kR-B8@mendel.bio.caltech.edu>

Test case fasta file
>8Achars
AAAAAAAA

all 6 frames for transeq, standard mode emits:
>_1
KKX
>_2
KKX
>_3
KK
>_4
FF
>_5
FFX
>_6
FFX

But...

AAAAAAAA               Forward
TTTTTTTT                          Reverse
                         abc      cba  <--- codons in diagram
^a^^b^^c^    phase 1   1 KKX    4 XFF
x^a^^b^^c^   phase 2   2 KKX    5 XFF
xx^a^^b^     phase 3   3 KK     6  FF

That is, frames 4->6 are supposed to use, respectively, the same
set of codons as 1->3, but translate on the opposite strand, shouldn't
the number of residues returned be like in the table above, also with
the X at the beginning rather than the end? Or to put it another way,
shouldn't the little "x" bases be ignored on the - strand if they were
also ignored on the +?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mathog at caltech.edu  Wed Feb 16 22:47:15 2011
From: mathog at caltech.edu (David Mathog)
Date: Wed, 16 Feb 2011 14:47:15 -0800
Subject: [EMBOSS] Transeq question, frame phases
Message-ID: <E1Ppq9T-0006pO-7W@mendel.bio.caltech.edu>

Here is another worked example with a small but real mRNA fragment.
(Best cut and paste it into a program with a fixed width font).

Test sequence:

>for  (AKA gi|1728|emb|V00893.1, this is "+" direction)
TCGAAAACCGGGCCATGAAGGATGAGGAGAAGATGGAGCTGCA
GGAGATGCAGCTGAAGGAGGCCAAGCACATTGCCGAGGACTCA
GACCGCAAATACGAGGAGGTGGCCAGGAAGCTGGTGATCCTCGA

>rev (for reversed)
TCGAGGATCACCAGCTTCCTGGCCACCTCCTCGTATTTGCGGT
CTGAGTCCTCGGCAATGTGCTTGGCCTCCTTCAGCTGCATCTC
CTGCAGCTCCATCTTCTCCTCATCCTTCATGGCCCGGTTTTCGA

Transeq output, all 6 frames, for >for and >rev
>for_1
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX
>for_2
RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>for_3
ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX
>for_4
RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>for_5
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX
>for_6
EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX
>rev_1
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX
>rev_2
RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>rev_3
EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX
>rev_4
RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>rev_5
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX
>rev_6
ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX

Output from a different program, all 12 frame options
shown on the fasta header line as: 

  phase(strand)

Positive phases are measured from sequence position 1. 
Negative phases measured from sequence position
N, the last base in the sequence. 
This program differs from transeq in that any
partial codon is emitted as an X.  Note how
transeq output never starts with an X, whereas
here the X maintains its position on the
Nucleic acid sequence, for instance, +1(+) and +1(-).

>gi|1728|emb|V00893.1|[+1(+)] 
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX
>gi|1728|emb|V00893.1|[+2(+)] 
RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>gi|1728|emb|V00893.1|[+3(+)] 
ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX
>gi|1728|emb|V00893.1|[+1(-)] 
XRGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>gi|1728|emb|V00893.1|[+2(-)] 
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFS
>gi|1728|emb|V00893.1|[+3(-)] 
XEDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVF
>gi|1728|emb|V00893.1|[-1(-)] 
SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX
>gi|1728|emb|V00893.1|[-2(-)] 
RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR
>gi|1728|emb|V00893.1|[-3(-)] 
EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX
>gi|1728|emb|V00893.1|[-1(+)] 
XRKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR
>gi|1728|emb|V00893.1|[-2(+)] 
SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SS
>gi|1728|emb|V00893.1|[-3(+)] 
XENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVIL
>gi|1728|emb|V00893.1| 

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From marvin.stodolsky at gmail.com  Thu Feb 17 02:07:51 2011
From: marvin.stodolsky at gmail.com (Marvin Stodolsky)
Date: Wed, 16 Feb 2011 21:07:51 -0500
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <4D5A4068.4000302@ebi.ac.uk>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
	<AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
	<4D5A4068.4000302@ebi.ac.uk>
Message-ID: <AANLkTimVesuYCH4RnUaOZZvFwpKek8Gy_y6hWxb8Gt7w@mail.gmail.com>

All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
problem has been worked out, per the Attachment, for those interested.

But for me the more important problem is making a FASTA repository,
which is a subset of the gene files in a much larger Repository.  This
is desirable before & after using Usearch -
http://www.drive5.com/usearch/intro.html
to select out a minimally homologous gene set of a species.
Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
the undesirables.

Specifically, is the command using ENTRET or relatives , to accept a list like
637008924
637008927
640691430
640691431
637008928
637008954
637008980
for extraction and repacking into a single smaller Repository?

If not, could you recommend a software tool/suite for this type of job.

MarvS

On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>
>> ?This is elementary I?m sure, but I?ve been unable to work out the
>> syntax ?from the documentation.
>> More minor issue.
>>
>> When using infoseq to extract all the fasta Headers from a sequence
>> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to
>> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix
>> for this?
>
> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
> part of the sequence ID in the FASTA file?
>
> You can use a delimiter between items for infoseq using:
>
> ?-nocolumn
>
> on the command line.
>
> For import into a spreadsheet you can set the delimiter to be tab with:
>
> ?-nocolumn -delimiter "\t"
>
> on the command line. That should then import nicely into a spreadsheet.
>
> Hope that helps
>
> Peter Rice
> EMBOSS Team
>


From biopython at maubp.freeserve.co.uk  Thu Feb 17 11:05:14 2011
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Feb 2011 11:05:14 +0000
Subject: [EMBOSS] Transeq question, frame phases
In-Reply-To: <E1PpoNx-0006kR-B8@mendel.bio.caltech.edu>
References: <E1PpoNx-0006kR-B8@mendel.bio.caltech.edu>
Message-ID: <AANLkTimoY8HgrXSbDxTGwoiTx260c+ZsojKJXrgiWG-4@mail.gmail.com>

On Wed, Feb 16, 2011 at 8:54 PM, David Mathog <mathog at caltech.edu> wrote:
> Test case fasta file
>>8Achars
> AAAAAAAA
>
> all 6 frames for transeq, standard mode emits:
>>_1
> KKX
>>_2
> KKX
>>_3
> KK
>>_4
> FF
>>_5
> FFX
>>_6
> FFX
>

Note you can do that with a single command line:

$ transeq asis:AAAAAAAA -filter -frame 6
>asis_1
KKX
>asis_2
KKX
>asis_3
KK
>asis_4
FF
>asis_5
FFX
>asis_6
FFX

Note that while using 1, 2, 3 for the forward frames is well defined, there
are two conventions for the reverse frame - do you start from the left or
the right?

First let's just do the forward frames,

$ transeq asis:AAAAAAAA -filter -frame 1
>asis_1
KKX
$ transeq asis:AAAAAAAA -filter -frame 2
>asis_2
KKX
$ transeq asis:AAAAAAAA -filter -frame 3
>asis_3
KK

Are you happy with them?

Now let's do that with the reverse complement strand:

$ transeq asis:TTTTTTTT -filter -frame 1
>asis_1
FFX
$ transeq asis:TTTTTTTT -filter -frame 2
>asis_2
FFX
$ transeq asis:TTTTTTTT -filter -frame 3
>asis_3
FF

Now let's do that with the original sequence but the negative frames:

$ transeq asis:AAAAAAAA -filter -frame -3
>asis_6
FFX
$ transeq asis:AAAAAAAA -filter -frame -2
>asis_5
FFX
$ transeq asis:AAAAAAAA -filter -frame -1
>asis_4
FF

Same results - perhaps the naming isn't as you expected?

Peter


From oliver.liegmann at biologie.uni-freiburg.de  Thu Feb 17 13:16:46 2011
From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann)
Date: Thu, 17 Feb 2011 14:16:46 +0100
Subject: [EMBOSS] seqret does not find sequence after update
In-Reply-To: <40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk>
References: <1296806018.12454.29.camel@yoda>
	<40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk>
Message-ID: <1297948606.14091.8.camel@yoda>

Hello,

thank you very much for your reply. dbifasta with "C" locales and
dbxfasta both seem to work well.

To answer your question: The operating system is Ubuntu 10.04 (Lucid)
with locales set to de_DE.utf8.

Best regards,
Oliver Liegmann

P.S.: I CC'ed this message to the list, for other users to know about
the workaround. Probably a short note should be written into the
documentation of dbifasta about the locales issue.

Am Freitag, den 04.02.2011, 19:34 +0000 schrieb ajb at ebi.ac.uk: 
> Hello,
> 
> I could reproduce your problem. It appears to be a manifestation of the
> GNU sort "sorting order". If you, depending on your shell, do:
> 
>   export LC_ALL=C
> 
> or
> 
>   setenv LC_ALL C
> 
> and then re-index using dbifasta then retrieval should work as expected.
> Alternatively use to dbx indexing system which does not rely on
> GNU sort.
> 
> Incidentally, what operating system and version are you using?
> 
> HTH
> 
> Alan Bleasby
> EBI
> 
> 
> 
> > Dear list members,
> >
> > does some of you also got this problem (and probably has an idea on what's
> > going wrong):
> >
> > After upgrading from version 6.2.0 to 6.3.1 seqret does not work
> > properly anymore:
> > First, Emboss was installed using
> > ./configure --enable-64 --prefix=/opt/emboss
> > make
> > make install
> >
> > The database was set up with:
> > dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas
> >
> > Using seqret to retrieve the sequences produces an error:
> > seqret plafa_test:PLAFA_MAL13P1.23-b
> > Reads and writes (returns) sequences
> > output sequence(s) [plafa_mal13p1.fasta]:
> > Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a'
> >
> > Only the remaining two sequences are stored in the output file.
> >
> > Are the allowed characters used in the accession changed? With Emboss
> > 6.2.0 we did not have any problems, but after upgrade a huge bunch of
> > sequences could not be retrieved anymore when used with our internal fasta
> > database, although the output in
> > outfile.dbifasta shows all sequences to be inserted into the database.
> >
> >
> > The content of the different files are:
> > PLAFA_test.fas:
> >>PLAFA_MAL13P1.23-b
> > MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH
> >>PLAFA_MAL13P1.237a
> > MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL
> >>PLAFA_MAL13P1.23-a
> > MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH
> >
> >
> > test.txt:
> > plafa:PLAFA_MAL13P1.23-b
> > plafa:PLAFA_MAL13P1.237a
> > plafa:PLAFA_MAL13P1.23-a
> >
> >
> > emboss.default:
> > DB plafa [
> > 	format: fasta
> > 	method: emblcd
> > 	directory: /home/liegmann/genomezoo/emboss/prob/test/db
> > 	type: P
> > ]
> >
> >
> >
> > Best regards,
> > Oliver Liegmann


-- 
Dipl.-Inf. Oliver Liegmann

AG Rensing
Fakult?t f?r Biologie
Albert-Ludwigs-Universit?t Freiburg
Hauptstra?e 1
D-79104 Freiburg

+49 761 203-2521
oliver.liegmann at biologie.uni-freiburg.de
http://www.plantco.de/people/Oliver.html

MOSS 2011 - the annual meeting on bryophyte research
http://plantco.de/MOSS2011/


From mathog at caltech.edu  Thu Feb 17 16:30:25 2011
From: mathog at caltech.edu (David Mathog)
Date: Thu, 17 Feb 2011 08:30:25 -0800
Subject: [EMBOSS] Transeq question, frame phases
Message-ID: <E1Pq6kL-00076N-SN@mendel.bio.caltech.edu>


> Now let's do that with the reverse complement strand:
> 
> $ transeq asis:TTTTTTTT -filter -frame 1
> >asis_1
> FFX
> $ transeq asis:TTTTTTTT -filter -frame 2
> >asis_2
> FFX
> $ transeq asis:TTTTTTTT -filter -frame 3
> >asis_3
> FF

That is the problem.  Let me try to explain more clearly what the issue is.


AAAAAAAA               Forward
TTTTTTTT                          Reverse
                         abc      cba  <--- codons in diagram
^a^^b^^c^    phase 1   1 KKX    4 XFF    EXPECTED
x^a^^b^^c^   phase 2   2 KKX    5 XFF    EXPECTED
xx^a^^b^     phase 3   3 KK     6  FF    EXPECTED


^a^^b^^c^    phase 1   1 KKX    4  FF    OBSERVED
x^a^^b^^c^   phase 2   2 KKX    5 FFX    OBSERVED
xx^a^^b^     phase 3   3 KK     6 FFX    OBSERVED

Assume an extra codon L to the left of a.
                         abc      baL  <--- codons in diagram
^a^^b^^c^    phase 1   1 KKX    4 FF     EXPLAINED?
^^a^^b^^c^   phase 2   2 KKX    5 FFX    EXPLAINED?
L^^a^^b^     phase 3   3 KK     6 FFX    EXPLAINED?


That is, if the meaning of the + phases is to define the three codons
a,b,c as shown in the diagram, such that the forward translation is as
shown, then the reverse translation should be as shown above in
expected.  That is, it is the translation of the exact same set of
codons done individually, but for the - strand reverse complement the
codon first, and then invert the resulting translated sequence.  That
way the X, where it occurs is attached to the same partial codon "c". 
What I think is happening in transeq is that it is starting with the
first full codon in the frame on the given strand.  In effect that
shifts the translated codons as shown in the "EXPLAINED?" section.

If partial codons were not translated then these would all be equivalent.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From biopython at maubp.freeserve.co.uk  Thu Feb 17 17:03:15 2011
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Feb 2011 17:03:15 +0000
Subject: [EMBOSS] Transeq question, frame phases
In-Reply-To: <E1Pq6kL-00076N-SN@mendel.bio.caltech.edu>
References: <E1Pq6kL-00076N-SN@mendel.bio.caltech.edu>
Message-ID: <AANLkTi=t-Lc-fCuUp5n-XP-MZ8rcXeudQUxtkXDbuzSb@mail.gmail.com>

On Thu, Feb 17, 2011 at 4:30 PM, David Mathog <mathog at caltech.edu> wrote:
>
>
>> Now let's do that with the reverse complement strand:
>>
>> $ transeq asis:TTTTTTTT -filter -frame 1
>> >asis_1
>> FFX

This is what I think that does (forward frames are easy):

Frame 1, so starts at first base:
Letters 123, codon TTT, gives F
Letters 456, codon TTT, gives F
Letters 78, partial codon TT-, gives X

>> $ transeq asis:TTTTTTTT -filter -frame 2
>> >asis_2
>> FFX

Frame 2, so starts at second base:
Letter 1, just T, ignored
Letters 234, codon TTT, gives F
Letters 567, codon TTT, gives F
Letters 8, partial codon T--, gives X

>> $ transeq asis:TTTTTTTT -filter -frame 3
>> >asis_3
>> FF

Frame 3, so starts at third base:
Letters 12, bases TT, ignored
Letters 345, codon TTT, gives F
Letters 678, codon TTT, gives F


> That is the problem. ?Let me try to explain more clearly what the issue is.
>
> That is, if the meaning of the + phases is to define the three codons
> a,b,c as shown in the diagram, such that the forward translation is as
> shown, then the reverse translation should be as shown above in
> expected.  That is, it is the translation of the exact same set of
> codons done individually, but for the - strand reverse complement the
> codon first, and then invert the resulting translated sequence.  That
> way the X, where it occurs is attached to the same partial codon "c".

I couldn't understand your diagram - probably font spacing issues in part.

The EMBOSS tool is doing all six frames, maybe all you need to work out
the is mapping between its naming and yours.

Note that it can make sense to translate a trailing partial codon, e.g.
TC... could be TCA, TCC, TCG or TCT which all code for S:

$ transeq asis:TCN -filter
>asis_1
S
$ transeq asis:TC -filter
>asis_1
S

Peter


From marvin.stodolsky at gmail.com  Fri Feb 18 02:23:15 2011
From: marvin.stodolsky at gmail.com (Marvin Stodolsky)
Date: Thu, 17 Feb 2011 21:23:15 -0500
Subject: [EMBOSS] FW: Reducing a FASTA repository, new user
In-Reply-To: <825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu>
References: <A68EAD09E5FE3948946190959C9BBAB8332F30FC2D@EXCH1P.sc.science.doe.gov>
	<AANLkTim6UtuxWtSQ1qHiM5V+Owot1OOj1LAY+G6b4A0z@mail.gmail.com>
	<4D5A4068.4000302@ebi.ac.uk>
	<AANLkTimVesuYCH4RnUaOZZvFwpKek8Gy_y6hWxb8Gt7w@mail.gmail.com>
	<825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu>
Message-ID: <AANLkTimaSiEnPX4O6rGWBqLaRoi=kA7kiGGHesEk4h50@mail.gmail.com>

Sorry,

Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.

MarvS

On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller <kellert at ohsu.edu> wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I received. Would you mind sending it?
>
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
>
>
>
>
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>
>> All thanks for the suggestions. ?A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>>
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository. ?This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>>
>> Specifically, is the command using ENTRET or relatives , to accept a list like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>>
>> If not, could you recommend a software tool/suite for this type of job.
>>
>> MarvS
>>
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>>>
>>>> ?This is elementary I?m sure, but I?ve been unable to work out the
>>>> syntax ?from the documentation.
>>>> More minor issue.
>>>>
>>>> When using infoseq to extract all the fasta Headers from a sequence
>>>> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to
>>>> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix
>>>> for this?
>>>
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>>
>>> You can use a delimiter between items for infoseq using:
>>>
>>> ?-nocolumn
>>>
>>> on the command line.
>>>
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>
>>> ?-nocolumn -delimiter "\t"
>>>
>>> on the command line. That should then import nicely into a spreadsheet.
>>>
>>> Hope that helps
>>>
>>> Peter Rice
>>> EMBOSS Team
>>>
>>
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
-------------- next part --------------


With respect to using info in the FASTA description field, the intent and partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., 
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0

DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]

In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.

Pulling the dataset into a spreadsheet, spaces in the description field were  next replaced with >< :
DnaJ><domain><protein><1828..2760(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)><[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)><[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)><[Mycoplasma><genitalium><G37]
thymidylate 8551..9183(+)><[Mycoplasma><genitalium><G37]

Next ><[ is replace by "to be field seperator"  |[
DNA><polymerase><III,><beta><subunit><686..1828(+)|[Mycoplasma><genitalium><G37]
DnaJ><domain><protein><1828..2760(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><B><2845..4797(+)|[Mycoplasma><genitalium><G37]
DNA><gyrase><subunit><A><4812..7322(+)|[Mycoplasma><genitalium><G37]
seryl-tRNA><synthetase><7294..8547(+)|[Mycoplasma><genitalium><G37]
and the file saved as:   Myc637000176m.csv 

to get rid of >< in the terminal common  [Mycoplasma><genitalium><G37], there was done 
$  cut -d"[" -f1 Myc637000176m.csv > Myc637000176m2.csv
resulting in :
DNA><polymerase><III,><beta><subunit><686..1828(+)|
DnaJ><domain><protein><1828..2760(+)|
DNA><gyrase><subunit><B><2845..4797(+)|
DNA><gyrase><subunit><A><4812..7322(+)|
seryl-tRNA><synthetase><7294..8547(+)|

internals are next mostly deleted with:
sed -e 's/<.*>//g'  Myc637000176m2.csv > Myc637000176m3.csv
resulting in:
DNA><686..1828(+)|
DnaJ><1828..2760(+)|
DNA><2845..4797(+)|
DNA><4812..7322(+)|
seryl-tRNA><7294..8547(+)|

The single remmaining >< is replaced with potential separator | 
sed -e 's/></|/g'  Myc637000176m3.csv > Myc637000176m4.csv
resulting in:
DNA|686..1828(+)|
DnaJ|1828..2760(+)|
DNA|2845..4797(+)|
DNA|4812..7322(+)|
seryl-tRNA|7294..8547(+)|
BASICALLY, the clever work is now done, and the rest is more routine manipulation.

A cleanup was done with:
sed -e 's/)|//g'  Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g'  Myc637000176m5.csv > Myc637000176m6.csv
together changing the  (+)|  to   |+   ,that is a separated field

The replacement of the residual  ..  with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file

After routine manipulations within the spread sheet, 
a view of the overlap detection section is:
 F       G      H                I               J  fields
Start 	End Begin-nextEnd  OR((H2<0),(H1<0))  Stable 0/1 Value, for SORTING on		
686	1828	0		FALSE		0		
1828	2760	85		FALSE		0		
2845	4797	15		TRUE		0		
4812	7322	-28		TRUE		1		
7294	8547	4		TRUE		1		
8551	9183	-27		TRUE		1
9156	9920	3		FALSE		0	
9923	11251	0		FALSE		0	

The overlapping genes have stable value 1,during  sorting, while field I FALSE/TRUE and not stable during SORTing
		

From egorleg at gmail.com  Wed Feb 23 00:56:40 2011
From: egorleg at gmail.com (Kevin Egan)
Date: Wed, 23 Feb 2011 00:56:40 +0000
Subject: [EMBOSS] Dot-matcher
In-Reply-To: <AANLkTi=Ua1uJk4Pb=oZOc-s+UUArAQVkBEKLABOcGSHJ@mail.gmail.com>
References: <mailman.9199.1298420178.2958.emboss@lists.open-bio.org>
	<AANLkTi=Ua1uJk4Pb=oZOc-s+UUArAQVkBEKLABOcGSHJ@mail.gmail.com>
Message-ID: <AANLkTi=Uggy3csMtVLqAMNZx0zk1WgD=42vyy36rPML7@mail.gmail.com>

Hi

I was wondering is there anywhere I could find the source code for
dot-matcher?


From uludag at ebi.ac.uk  Wed Feb 23 09:50:28 2011
From: uludag at ebi.ac.uk (Mahmut Uludag)
Date: Wed, 23 Feb 2011 09:50:28 +0000
Subject: [EMBOSS] Dot-matcher
In-Reply-To: <AANLkTi=Uggy3csMtVLqAMNZx0zk1WgD=42vyy36rPML7@mail.gmail.com>
References: <mailman.9199.1298420178.2958.emboss@lists.open-bio.org>
	<AANLkTi=Ua1uJk4Pb=oZOc-s+UUArAQVkBEKLABOcGSHJ@mail.gmail.com>
	<AANLkTi=Uggy3csMtVLqAMNZx0zk1WgD=42vyy36rPML7@mail.gmail.com>
Message-ID: <1298454628.8626.3.camel@emboss1.ebi.ac.uk>

Hi Kevin,

> I was wondering is there anywhere I could find the source code for
> dot-matcher?

EMBOSS release tarballs include source files for EMBOSS applications and
for EMBOSS libraries.

    ftp://emboss.open-bio.org/pub/EMBOSS/

Regards,
Mahmut