From wo.granon at gmail.com Wed Feb 2 14:02:46 2011 From: wo.granon at gmail.com (Wolfgang Gruber) Date: Wed, 2 Feb 2011 20:02:46 +0100 Subject: [EMBOSS] Mistake in Appdoc Edialign? Message-ID: Hello, I studied the papers for DIALIGN and only in the newest Version DIALIGN-TX (Subramanian u. a., 2008) I can find the information that DIALIGN uses a guide tree. In the appdoc to edialign I read that emboss uses DIALIGN2. In this Publikation (Morgenstern, 1999) I cannot find an information that a guide tree is used. Also in the original DIALIGN2 documentation I read: "This tree is constructed by applying the UPGMA clustering method to the DIALIGN similarity scores." but nothing that this tree is used for guiding. So is this information in the emboss appdoc incorrect? At all: is there a plan to update to DIALIGN-TX? Thanks, Wolfgang Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. In: Bioinformatics (Oxford, England) Bd.?15 (1999), Nr.?3, S.?211-218. ??PMID: 10222408 Subramanian, A. ; Kaufmann, M. ; Morgenstern, B.: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. In: Algorithms for Molecular Biology Bd.?3 (2008), Nr.?1, S.?6 From oliver.liegmann at biologie.uni-freiburg.de Fri Feb 4 02:53:38 2011 From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann) Date: Fri, 04 Feb 2011 08:53:38 +0100 Subject: [EMBOSS] seqret does not find sequence after update Message-ID: <1296806018.12454.29.camel@yoda> Dear list members, does some of you also got this problem (and probably has an idea on what's going wrong): After upgrading from version 6.2.0 to 6.3.1 seqret does not work properly anymore: First, Emboss was installed using ./configure --enable-64 --prefix=/opt/emboss make make install The database was set up with: dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas Using seqret to retrieve the sequences produces an error: seqret plafa_test:PLAFA_MAL13P1.23-b Reads and writes (returns) sequences output sequence(s) [plafa_mal13p1.fasta]: Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a' Only the remaining two sequences are stored in the output file. Are the allowed characters used in the accession changed? With Emboss 6.2.0 we did not have any problems, but after upgrade a huge bunch of sequences could not be retrieved anymore when used with our internal fasta database, although the output in outfile.dbifasta shows all sequences to be inserted into the database. The content of the different files are: PLAFA_test.fas: >PLAFA_MAL13P1.23-b MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH >PLAFA_MAL13P1.237a MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL >PLAFA_MAL13P1.23-a MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH test.txt: plafa:PLAFA_MAL13P1.23-b plafa:PLAFA_MAL13P1.237a plafa:PLAFA_MAL13P1.23-a emboss.default: DB plafa [ format: fasta method: emblcd directory: /home/liegmann/genomezoo/emboss/prob/test/db type: P ] Best regards, Oliver Liegmann -- Dipl.-Inf. Oliver Liegmann AG Rensing Fakult?t f?r Biologie Albert-Ludwigs-Universit?t Freiburg Hauptstra?e 1 D-79104 Freiburg +49 761 203-2521 oliver.liegmann at biologie.uni-freiburg.de http://www.plantco.de/people/Oliver.html -------------- next part -------------- A non-text attachment was scrubbed... Name: outfile.dbifasta Type: application/octet-stream Size: 848 bytes Desc: not available URL: From Caroline.Barretto at rdls.nestle.com Tue Feb 8 05:01:28 2011 From: Caroline.Barretto at rdls.nestle.com (Barretto, Caroline, LAUSANNE, BioInformatics) Date: Tue, 8 Feb 2011 11:01:28 +0100 Subject: [EMBOSS] diffseq memory problem? Message-ID: Dear EMBOSS developers, I have been using diffseq to compare too strains of the same bacteria species using "10" as wordsize without any problem. However, when I try to reduce this number to "4", after several hours of calculation the server collapses, all RAM and SWAP are used. Is there any option to avoid that, or do you know if someone is working on that problem? Many thanks, Best regards, Caroline. From pmr at ebi.ac.uk Tue Feb 8 05:46:32 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 08 Feb 2011 10:46:32 +0000 Subject: [EMBOSS] diffseq memory problem? In-Reply-To: References: Message-ID: <4D511F08.7010206@ebi.ac.uk> Dear Caroline, On 08/02/2011 10:01, Barretto, Caroline, LAUSANNE, BioInformatics wrote: > Dear EMBOSS developers, > > I have been using diffseq to compare too strains of the same bacteria > species using "10" as wordsize without any problem. > > However, when I try to reduce this number to "4", after several hours of > calculation the server collapses, all RAM and SWAP are used. > > Is there any option to avoid that, or do you know if someone is working > on that problem? Depending on the input size, and the number of simple repeats, a low word size could easily generate too many matches for large sequence lengths. We would recommend reducing the word size more slowly (maybe 10, 8, 6). As a guideline, finding more matches than there are non-overlapping words in the sequence is unlikely to be useful and is a reasonable point to stop reducing the word size. Meanwhile, we will take a look at diffseq in case there is some way to improve its performance or to warn an early stage if the word size appears small for the input sequence lengths and may generate too many matches. Hope this helps Peter Rice EMBOSS Team From WulfDirk.Leuschner at sanofi-aventis.com Thu Feb 10 02:43:17 2011 From: WulfDirk.Leuschner at sanofi-aventis.com (WulfDirk.Leuschner at sanofi-aventis.com) Date: Thu, 10 Feb 2011 08:43:17 +0100 Subject: [EMBOSS] lit. references for EMBOSS data files, e.g. Epk.dat (iep usage) Message-ID: <650F10565E484347B51CF6679663A80402F51F19@ffpw10.f2.enterprise> Hi all, I was wondering whether someone might know something about how some of the meta data used in EMBOSS were compiled. A colleague of mine was looking for a reference for the Epk.dat values used for the determination of the isoelectric point of a protein. However, neither she nor I could find anything... Any hints? Wulf Dirk Leuschner From jison at ebi.ac.uk Thu Feb 10 11:42:34 2011 From: jison at ebi.ac.uk (Jon Ison) Date: Thu, 10 Feb 2011 16:42:34 -0000 (UTC) Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D41F992.5030900@dartmouth.edu> References: <4D41F992.5030900@dartmouth.edu> Message-ID: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> Hi Lionel Didn't see a reply to you, sorry. Anyhow, dreg will search the sequence as given. This is taken as the sense/coding/+ strand. If you specify -sreverse (which is available to any applications that read sequences) it will I think search the reverse complement of that sequence instead. Cheers Jon > Hello fellow EMBOSS fans, > > I am using the dreg program to search the human genome for my favorite > motif. I was unable to find any information regarding the meaning of > the strand information in the output. Does dreg search both strands or > will it always return "+" as the strand designation of the hits that it > finds? > > Thanks for your continued support and development of this fantastic tool! > > Sincerely, > Lionel "Lee" Brooks 3rd > Dartmouth Genetics Grad Student > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From sigve.nakken at medisin.uio.no Fri Feb 11 05:54:44 2011 From: sigve.nakken at medisin.uio.no (Sigve Nakken) Date: Fri, 11 Feb 2011 11:54:44 +0100 Subject: [EMBOSS] DNA sequence as input argument Message-ID: <4D551574.2080004@medisin.uio.no> Hi, Is there any way in which one can read a DNA sequence directly from the command line (that is as a string input argument) rather than from a file? I am especially interested in finding repeats, inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead of creating a FASTA file for each query sequene, I would like to read the sequence directly from the command line. Is this possible? Kind regards, Sigve From pmr at ebi.ac.uk Fri Feb 11 06:11:13 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 11 Feb 2011 11:11:13 +0000 Subject: [EMBOSS] DNA sequence as input argument In-Reply-To: <4D551574.2080004@medisin.uio.no> References: <4D551574.2080004@medisin.uio.no> Message-ID: <4D551951.7030406@ebi.ac.uk> Dear Sigve, On 11/02/2011 10:54, Sigve Nakken wrote: > Hi, > > Is there any way in which one can read a DNA sequence directly from the > command line (that is > as a string input argument) rather than from a file? I am especially > interested in finding repeats, > inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead > of creating a FASTA file > for each query sequene, I would like to read the sequence directly from > the command line. Is this possible? seqret asis::ctgatcgatgctagctgac the "asis" format was included exactly for this purpose. You do need to take care that a long sequence is not too long for your shell to handle on the command line (a shell issue, not an EMBOSS issue). You can also add to the command line: -sid abc123 This will give it an ID of abc123 and the output file will default to (for seqret) abc123.fasta and will have the abc123 identifier in it. Hope this helps Peter Rice EMBOSS Team From stephen.taylor at imm.ox.ac.uk Fri Feb 11 06:13:45 2011 From: stephen.taylor at imm.ox.ac.uk (Steve Taylor) Date: Fri, 11 Feb 2011 11:13:45 +0000 Subject: [EMBOSS] DNA sequence as input argument In-Reply-To: <4D551574.2080004@medisin.uio.no> References: <4D551574.2080004@medisin.uio.no> Message-ID: <4D5519E9.1050401@imm.ox.ac.uk> Hi, > > Is there any way in which one can read a DNA sequence directly from the > command line (that is > as a string input argument) rather than from a file? I am especially > interested in finding repeats, > inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead > of creating a FASTA file > for each query sequene, I would like to read the sequence directly from > the command line. Is this possible? > From http://emboss.sourceforge.net/docs/faq.html A) The "filename" is really the sequence. This is a quick and easy way of reading in a short fragment of sequence without having to enter it into a file. For example: % program -seq asis::ATGGTGAGGAGAGTTGTGATGAGA Steve From Lionel.Brooks at dartmouth.edu Fri Feb 11 13:22:38 2011 From: Lionel.Brooks at dartmouth.edu (Lionel (Lee) Brooks 3rd) Date: Fri, 11 Feb 2011 13:22:38 -0500 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> Message-ID: <4D557E6E.3020608@dartmouth.edu> Hi Jon, Thank you! Apparently, I should have rtfm more than once. Sincerely, Lionel Jon Ison wrote: > Hi Lionel > > Didn't see a reply to you, sorry. > > Anyhow, dreg will search the sequence as given. This is taken as the sense/coding/+ strand. > > If you specify -sreverse (which is available to any applications that read sequences) it will I > think search the reverse complement of that sequence instead. > > Cheers > > Jon > > > > >> Hello fellow EMBOSS fans, >> >> I am using the dreg program to search the human genome for my favorite >> motif. I was unable to find any information regarding the meaning of >> the strand information in the output. Does dreg search both strands or >> will it always return "+" as the strand designation of the hits that it >> finds? >> >> Thanks for your continued support and development of this fantastic tool! >> >> Sincerely, >> Lionel "Lee" Brooks 3rd >> Dartmouth Genetics Grad Student >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss >> >> > > > From pmr at ebi.ac.uk Fri Feb 11 13:46:22 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 11 Feb 2011 18:46:22 +0000 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D557E6E.3020608@dartmouth.edu> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> <4D557E6E.3020608@dartmouth.edu> Message-ID: <4D5583FE.1060703@ebi.ac.uk> Dear Lee, On 11/02/2011 18:22, Lionel (Lee) Brooks 3rd wrote: > Hi Jon, > > Thank you! Apparently, I should have rtfm more than once. True ... but it not obvious which part to (re-)read. We could make it easier. Perhaps the "Input sequence" could point to the sequence qualifiers and the USA syntax. We will look at improving this part of the documentation in the next release. regards, Peter Rice EMBOSS Team From db60 at st-andrews.ac.uk Sat Feb 12 07:07:03 2011 From: db60 at st-andrews.ac.uk (Daniel Barker) Date: Sat, 12 Feb 2011 12:07:03 +0000 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D5583FE.1060703@ebi.ac.uk> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> <4D557E6E.3020608@dartmouth.edu> <4D5583FE.1060703@ebi.ac.uk> Message-ID: <4D5677E7.1080402@st-andrews.ac.uk> Dear Peter, A lot of the time for nucleotide stuff it makes sense to search both strands. Of course, it isn't hard to search one strand, then the other. But this introduces an extra step. I wonder if there could be some convenient option to do this, and if it should perhaps be the default? (As with NCBI blastall with any kind of nucleotide search.) This would affect programs beyond just dreg and, though it would be OK for our work, perhaps it wouldn't make sense for others. Just a thought. Best regards, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 From pmr at ebi.ac.uk Sat Feb 12 11:58:32 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Sat, 12 Feb 2011 16:58:32 +0000 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D5677E7.1080402@st-andrews.ac.uk> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> <4D557E6E.3020608@dartmouth.edu> <4D5583FE.1060703@ebi.ac.uk> <4D5677E7.1080402@st-andrews.ac.uk> Message-ID: <4D56BC38.9070408@ebi.ac.uk> Dear Daniel, On 12/02/2011 12:07, Daniel Barker wrote: > A lot of the time for nucleotide stuff it makes sense to search both > strands. Of course, it isn't hard to search one strand, then the other. > But this introduces an extra step. I wonder if there could be some > convenient option to do this, and if it should perhaps be the default? > (As with NCBI blastall with any kind of nucleotide search.) > > This would affect programs beyond just dreg and, though it would be OK > for our work, perhaps it wouldn't make sense for others. Just a thought. Interesting suggestion. Maybe we can add a -bothstrands option for applications to search the forward and reverse strands. We need to consider: * Do the results make sense? * What default do we set (maybe some programs have a different default)? * Is this complicated for programs that can use DNA or protein input? * Can we apply it to applications aligning two sequences? meanwhile, running twice with -sreverse the second time will find you all the matches. regards, Peter Rice EMBOSS Team From david.bauer at bayer.com Mon Feb 14 02:36:46 2011 From: david.bauer at bayer.com (david.bauer at bayer.com) Date: Mon, 14 Feb 2011 08:36:46 +0100 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D56BC38.9070408@ebi.ac.uk> Message-ID: Hi Daniel & Peter, emboss-bounces at lists.open-bio.org schrieb am 12/02/2011 17:58:32: > Dear Daniel, > > On 12/02/2011 12:07, Daniel Barker wrote: > > A lot of the time for nucleotide stuff it makes sense to search both > > strands. Of course, it isn't hard to search one strand, then the other. > > But this introduces an extra step. I wonder if there could be some > > convenient option to do this, and if it should perhaps be the default? > > (As with NCBI blastall with any kind of nucleotide search.) > > > > This would affect programs beyond just dreg and, though it would be OK > > for our work, perhaps it wouldn't make sense for others. Just a thought. > I think another candidate would be fuzznuc. (That's at least the program, where I sometimes missed this option ;-) > Interesting suggestion. > > Maybe we can add a -bothstrands option for applications to search the > forward and reverse strands. Yes, this would add the new functionality without breaking the old default behaviour of the programs. > We need to consider: > * Do the results make sense? > * What default do we set (maybe some programs have a different default)? As mentioned above, I would not touch the old default settings and add searching both strands as an option. (e.g. in stssearch the search of both strands is already the default.) > * Is this complicated for programs that can use DNA or protein input? > * Can we apply it to applications aligning two sequences? I think it could make sense for programs which can align one sequence against a set of other sequences (e.g. water, needle). Regards, David. From marvin.stodolsky at gmail.com Mon Feb 14 18:35:45 2011 From: marvin.stodolsky at gmail.com (Marvin Stodolsky) Date: Mon, 14 Feb 2011 18:35:45 -0500 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: References: Message-ID: This is elementary I?m sure, but I?ve been unable to work out the syntax from the documentation. More minor issue. When using infoseq to extract all the fasta Headers from a sequence Repository, the GeneBegin..GeneEnd (like?? 234466..234589) often fails to come as a uniform field/fields in a resultant spreadsheet.? Is there a Fix for this? MarvS From pmr at ebi.ac.uk Tue Feb 15 03:59:20 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 15 Feb 2011 08:59:20 +0000 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: References: Message-ID: <4D5A4068.4000302@ebi.ac.uk> On 14/02/2011 23:35, Marvin Stodolsky wrote: > This is elementary I?m sure, but I?ve been unable to work out the > syntax from the documentation. > More minor issue. > > When using infoseq to extract all the fasta Headers from a sequence > Repository, the GeneBegin..GeneEnd (like 234466..234589) often fails to > come as a uniform field/fields in a resultant spreadsheet. Is there a Fix > for this? I don't see the genebegin and geneend in EMBOSS infoseq output. Are they part of the sequence ID in the FASTA file? You can use a delimiter between items for infoseq using: -nocolumn on the command line. For import into a spreadsheet you can set the delimiter to be tab with: -nocolumn -delimiter "\t" on the command line. That should then import nicely into a spreadsheet. Hope that helps Peter Rice EMBOSS Team From mathog at caltech.edu Wed Feb 16 15:54:05 2011 From: mathog at caltech.edu (David Mathog) Date: Wed, 16 Feb 2011 12:54:05 -0800 Subject: [EMBOSS] Transeq question, frame phases Message-ID: Test case fasta file >8Achars AAAAAAAA all 6 frames for transeq, standard mode emits: >_1 KKX >_2 KKX >_3 KK >_4 FF >_5 FFX >_6 FFX But... AAAAAAAA Forward TTTTTTTT Reverse abc cba <--- codons in diagram ^a^^b^^c^ phase 1 1 KKX 4 XFF x^a^^b^^c^ phase 2 2 KKX 5 XFF xx^a^^b^ phase 3 3 KK 6 FF That is, frames 4->6 are supposed to use, respectively, the same set of codons as 1->3, but translate on the opposite strand, shouldn't the number of residues returned be like in the table above, also with the X at the beginning rather than the end? Or to put it another way, shouldn't the little "x" bases be ignored on the - strand if they were also ignored on the +? Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Wed Feb 16 17:47:15 2011 From: mathog at caltech.edu (David Mathog) Date: Wed, 16 Feb 2011 14:47:15 -0800 Subject: [EMBOSS] Transeq question, frame phases Message-ID: Here is another worked example with a small but real mRNA fragment. (Best cut and paste it into a program with a fixed width font). Test sequence: >for (AKA gi|1728|emb|V00893.1, this is "+" direction) TCGAAAACCGGGCCATGAAGGATGAGGAGAAGATGGAGCTGCA GGAGATGCAGCTGAAGGAGGCCAAGCACATTGCCGAGGACTCA GACCGCAAATACGAGGAGGTGGCCAGGAAGCTGGTGATCCTCGA >rev (for reversed) TCGAGGATCACCAGCTTCCTGGCCACCTCCTCGTATTTGCGGT CTGAGTCCTCGGCAATGTGCTTGGCCTCCTTCAGCTGCATCTC CTGCAGCTCCATCTTCTCCTCATCCTTCATGGCCCGGTTTTCGA Transeq output, all 6 frames, for >for and >rev >for_1 SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX >for_2 RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >for_3 ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX >for_4 RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >for_5 SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX >for_6 EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX >rev_1 SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX >rev_2 RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >rev_3 EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX >rev_4 RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >rev_5 SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX >rev_6 ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX Output from a different program, all 12 frame options shown on the fasta header line as: phase(strand) Positive phases are measured from sequence position 1. Negative phases measured from sequence position N, the last base in the sequence. This program differs from transeq in that any partial codon is emitted as an X. Note how transeq output never starts with an X, whereas here the X maintains its position on the Nucleic acid sequence, for instance, +1(+) and +1(-). >gi|1728|emb|V00893.1|[+1(+)] SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX >gi|1728|emb|V00893.1|[+2(+)] RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >gi|1728|emb|V00893.1|[+3(+)] ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX >gi|1728|emb|V00893.1|[+1(-)] XRGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >gi|1728|emb|V00893.1|[+2(-)] SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFS >gi|1728|emb|V00893.1|[+3(-)] XEDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVF >gi|1728|emb|V00893.1|[-1(-)] SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX >gi|1728|emb|V00893.1|[-2(-)] RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >gi|1728|emb|V00893.1|[-3(-)] EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX >gi|1728|emb|V00893.1|[-1(+)] XRKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >gi|1728|emb|V00893.1|[-2(+)] SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SS >gi|1728|emb|V00893.1|[-3(+)] XENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVIL >gi|1728|emb|V00893.1| Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From marvin.stodolsky at gmail.com Wed Feb 16 21:07:51 2011 From: marvin.stodolsky at gmail.com (Marvin Stodolsky) Date: Wed, 16 Feb 2011 21:07:51 -0500 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: <4D5A4068.4000302@ebi.ac.uk> References: <4D5A4068.4000302@ebi.ac.uk> Message-ID: All thanks for the suggestions. A solution to the GeneBegin..GeneEnd problem has been worked out, per the Attachment, for those interested. But for me the more important problem is making a FASTA repository, which is a subset of the gene files in a much larger Repository. This is desirable before & after using Usearch - http://www.drive5.com/usearch/intro.html to select out a minimally homologous gene set of a species. Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among the undesirables. Specifically, is the command using ENTRET or relatives , to accept a list like 637008924 637008927 640691430 640691431 637008928 637008954 637008980 for extraction and repacking into a single smaller Repository? If not, could you recommend a software tool/suite for this type of job. MarvS On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice wrote: > On 14/02/2011 23:35, Marvin Stodolsky wrote: >> >> ?This is elementary I?m sure, but I?ve been unable to work out the >> syntax ?from the documentation. >> More minor issue. >> >> When using infoseq to extract all the fasta Headers from a sequence >> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to >> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix >> for this? > > I don't see the genebegin and geneend in EMBOSS infoseq output. Are they > part of the sequence ID in the FASTA file? > > You can use a delimiter between items for infoseq using: > > ?-nocolumn > > on the command line. > > For import into a spreadsheet you can set the delimiter to be tab with: > > ?-nocolumn -delimiter "\t" > > on the command line. That should then import nicely into a spreadsheet. > > Hope that helps > > Peter Rice > EMBOSS Team > From biopython at maubp.freeserve.co.uk Thu Feb 17 06:05:14 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Feb 2011 11:05:14 +0000 Subject: [EMBOSS] Transeq question, frame phases In-Reply-To: References: Message-ID: On Wed, Feb 16, 2011 at 8:54 PM, David Mathog wrote: > Test case fasta file >>8Achars > AAAAAAAA > > all 6 frames for transeq, standard mode emits: >>_1 > KKX >>_2 > KKX >>_3 > KK >>_4 > FF >>_5 > FFX >>_6 > FFX > Note you can do that with a single command line: $ transeq asis:AAAAAAAA -filter -frame 6 >asis_1 KKX >asis_2 KKX >asis_3 KK >asis_4 FF >asis_5 FFX >asis_6 FFX Note that while using 1, 2, 3 for the forward frames is well defined, there are two conventions for the reverse frame - do you start from the left or the right? First let's just do the forward frames, $ transeq asis:AAAAAAAA -filter -frame 1 >asis_1 KKX $ transeq asis:AAAAAAAA -filter -frame 2 >asis_2 KKX $ transeq asis:AAAAAAAA -filter -frame 3 >asis_3 KK Are you happy with them? Now let's do that with the reverse complement strand: $ transeq asis:TTTTTTTT -filter -frame 1 >asis_1 FFX $ transeq asis:TTTTTTTT -filter -frame 2 >asis_2 FFX $ transeq asis:TTTTTTTT -filter -frame 3 >asis_3 FF Now let's do that with the original sequence but the negative frames: $ transeq asis:AAAAAAAA -filter -frame -3 >asis_6 FFX $ transeq asis:AAAAAAAA -filter -frame -2 >asis_5 FFX $ transeq asis:AAAAAAAA -filter -frame -1 >asis_4 FF Same results - perhaps the naming isn't as you expected? Peter From oliver.liegmann at biologie.uni-freiburg.de Thu Feb 17 08:16:46 2011 From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann) Date: Thu, 17 Feb 2011 14:16:46 +0100 Subject: [EMBOSS] seqret does not find sequence after update In-Reply-To: <40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk> References: <1296806018.12454.29.camel@yoda> <40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk> Message-ID: <1297948606.14091.8.camel@yoda> Hello, thank you very much for your reply. dbifasta with "C" locales and dbxfasta both seem to work well. To answer your question: The operating system is Ubuntu 10.04 (Lucid) with locales set to de_DE.utf8. Best regards, Oliver Liegmann P.S.: I CC'ed this message to the list, for other users to know about the workaround. Probably a short note should be written into the documentation of dbifasta about the locales issue. Am Freitag, den 04.02.2011, 19:34 +0000 schrieb ajb at ebi.ac.uk: > Hello, > > I could reproduce your problem. It appears to be a manifestation of the > GNU sort "sorting order". If you, depending on your shell, do: > > export LC_ALL=C > > or > > setenv LC_ALL C > > and then re-index using dbifasta then retrieval should work as expected. > Alternatively use to dbx indexing system which does not rely on > GNU sort. > > Incidentally, what operating system and version are you using? > > HTH > > Alan Bleasby > EBI > > > > > Dear list members, > > > > does some of you also got this problem (and probably has an idea on what's > > going wrong): > > > > After upgrading from version 6.2.0 to 6.3.1 seqret does not work > > properly anymore: > > First, Emboss was installed using > > ./configure --enable-64 --prefix=/opt/emboss > > make > > make install > > > > The database was set up with: > > dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas > > > > Using seqret to retrieve the sequences produces an error: > > seqret plafa_test:PLAFA_MAL13P1.23-b > > Reads and writes (returns) sequences > > output sequence(s) [plafa_mal13p1.fasta]: > > Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a' > > > > Only the remaining two sequences are stored in the output file. > > > > Are the allowed characters used in the accession changed? With Emboss > > 6.2.0 we did not have any problems, but after upgrade a huge bunch of > > sequences could not be retrieved anymore when used with our internal fasta > > database, although the output in > > outfile.dbifasta shows all sequences to be inserted into the database. > > > > > > The content of the different files are: > > PLAFA_test.fas: > >>PLAFA_MAL13P1.23-b > > MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH > >>PLAFA_MAL13P1.237a > > MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL > >>PLAFA_MAL13P1.23-a > > MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH > > > > > > test.txt: > > plafa:PLAFA_MAL13P1.23-b > > plafa:PLAFA_MAL13P1.237a > > plafa:PLAFA_MAL13P1.23-a > > > > > > emboss.default: > > DB plafa [ > > format: fasta > > method: emblcd > > directory: /home/liegmann/genomezoo/emboss/prob/test/db > > type: P > > ] > > > > > > > > Best regards, > > Oliver Liegmann -- Dipl.-Inf. Oliver Liegmann AG Rensing Fakult?t f?r Biologie Albert-Ludwigs-Universit?t Freiburg Hauptstra?e 1 D-79104 Freiburg +49 761 203-2521 oliver.liegmann at biologie.uni-freiburg.de http://www.plantco.de/people/Oliver.html MOSS 2011 - the annual meeting on bryophyte research http://plantco.de/MOSS2011/ From mathog at caltech.edu Thu Feb 17 11:30:25 2011 From: mathog at caltech.edu (David Mathog) Date: Thu, 17 Feb 2011 08:30:25 -0800 Subject: [EMBOSS] Transeq question, frame phases Message-ID: > Now let's do that with the reverse complement strand: > > $ transeq asis:TTTTTTTT -filter -frame 1 > >asis_1 > FFX > $ transeq asis:TTTTTTTT -filter -frame 2 > >asis_2 > FFX > $ transeq asis:TTTTTTTT -filter -frame 3 > >asis_3 > FF That is the problem. Let me try to explain more clearly what the issue is. AAAAAAAA Forward TTTTTTTT Reverse abc cba <--- codons in diagram ^a^^b^^c^ phase 1 1 KKX 4 XFF EXPECTED x^a^^b^^c^ phase 2 2 KKX 5 XFF EXPECTED xx^a^^b^ phase 3 3 KK 6 FF EXPECTED ^a^^b^^c^ phase 1 1 KKX 4 FF OBSERVED x^a^^b^^c^ phase 2 2 KKX 5 FFX OBSERVED xx^a^^b^ phase 3 3 KK 6 FFX OBSERVED Assume an extra codon L to the left of a. abc baL <--- codons in diagram ^a^^b^^c^ phase 1 1 KKX 4 FF EXPLAINED? ^^a^^b^^c^ phase 2 2 KKX 5 FFX EXPLAINED? L^^a^^b^ phase 3 3 KK 6 FFX EXPLAINED? That is, if the meaning of the + phases is to define the three codons a,b,c as shown in the diagram, such that the forward translation is as shown, then the reverse translation should be as shown above in expected. That is, it is the translation of the exact same set of codons done individually, but for the - strand reverse complement the codon first, and then invert the resulting translated sequence. That way the X, where it occurs is attached to the same partial codon "c". What I think is happening in transeq is that it is starting with the first full codon in the frame on the given strand. In effect that shifts the translated codons as shown in the "EXPLAINED?" section. If partial codons were not translated then these would all be equivalent. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From biopython at maubp.freeserve.co.uk Thu Feb 17 12:03:15 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Feb 2011 17:03:15 +0000 Subject: [EMBOSS] Transeq question, frame phases In-Reply-To: References: Message-ID: On Thu, Feb 17, 2011 at 4:30 PM, David Mathog wrote: > > >> Now let's do that with the reverse complement strand: >> >> $ transeq asis:TTTTTTTT -filter -frame 1 >> >asis_1 >> FFX This is what I think that does (forward frames are easy): Frame 1, so starts at first base: Letters 123, codon TTT, gives F Letters 456, codon TTT, gives F Letters 78, partial codon TT-, gives X >> $ transeq asis:TTTTTTTT -filter -frame 2 >> >asis_2 >> FFX Frame 2, so starts at second base: Letter 1, just T, ignored Letters 234, codon TTT, gives F Letters 567, codon TTT, gives F Letters 8, partial codon T--, gives X >> $ transeq asis:TTTTTTTT -filter -frame 3 >> >asis_3 >> FF Frame 3, so starts at third base: Letters 12, bases TT, ignored Letters 345, codon TTT, gives F Letters 678, codon TTT, gives F > That is the problem. ?Let me try to explain more clearly what the issue is. > > That is, if the meaning of the + phases is to define the three codons > a,b,c as shown in the diagram, such that the forward translation is as > shown, then the reverse translation should be as shown above in > expected. That is, it is the translation of the exact same set of > codons done individually, but for the - strand reverse complement the > codon first, and then invert the resulting translated sequence. That > way the X, where it occurs is attached to the same partial codon "c". I couldn't understand your diagram - probably font spacing issues in part. The EMBOSS tool is doing all six frames, maybe all you need to work out the is mapping between its naming and yours. Note that it can make sense to translate a trailing partial codon, e.g. TC... could be TCA, TCC, TCG or TCT which all code for S: $ transeq asis:TCN -filter >asis_1 S $ transeq asis:TC -filter >asis_1 S Peter From marvin.stodolsky at gmail.com Thu Feb 17 21:23:15 2011 From: marvin.stodolsky at gmail.com (Marvin Stodolsky) Date: Thu, 17 Feb 2011 21:23:15 -0500 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: <825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu> References: <4D5A4068.4000302@ebi.ac.uk> <825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu> Message-ID: Sorry, Here is the attachment. The whole cleanup process could be done with pm;y SED calls I'm sure, but would be beyond my SED comfort level. MarvS On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller wrote: > HI Martin, > I am interested i the solution. There was no attachment to the email I received. Would you mind sending it? > > thank you, > Tom > MMI DNA Services Core Facility > 503-494-2442 > kellert at ohsu.edu > Office: 6588 RJH (CROET/BasicScience) > > > > > > On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote: > >> All thanks for the suggestions. ?A solution to the GeneBegin..GeneEnd >> problem has been worked out, per the Attachment, for those interested. >> >> But for me the more important problem is making a FASTA repository, >> which is a subset of the gene files in a much larger Repository. ?This >> is desirable before & after using Usearch - >> http://www.drive5.com/usearch/intro.html >> to select out a minimally homologous gene set of a species. >> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among >> the undesirables. >> >> Specifically, is the command using ENTRET or relatives , to accept a list like >> 637008924 >> 637008927 >> 640691430 >> 640691431 >> 637008928 >> 637008954 >> 637008980 >> for extraction and repacking into a single smaller Repository? >> >> If not, could you recommend a software tool/suite for this type of job. >> >> MarvS >> >> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice wrote: >>> On 14/02/2011 23:35, Marvin Stodolsky wrote: >>>> >>>> ?This is elementary I?m sure, but I?ve been unable to work out the >>>> syntax ?from the documentation. >>>> More minor issue. >>>> >>>> When using infoseq to extract all the fasta Headers from a sequence >>>> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to >>>> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix >>>> for this? >>> >>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they >>> part of the sequence ID in the FASTA file? >>> >>> You can use a delimiter between items for infoseq using: >>> >>> ?-nocolumn >>> >>> on the command line. >>> >>> For import into a spreadsheet you can set the delimiter to be tab with: >>> >>> ?-nocolumn -delimiter "\t" >>> >>> on the command line. That should then import nicely into a spreadsheet. >>> >>> Hope that helps >>> >>> Peter Rice >>> EMBOSS Team >>> >> >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > > -------------- next part -------------- With respect to using info in the FASTA description field, the intent and partial solution can now be explained. The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned. The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0 DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37] DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37] DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37] seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37] thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37] In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes. A non-elegant fix has been worked out. Pulling the dataset into a spreadsheet, spaces in the description field were next replaced with >< : DnaJ><1828..2760(+)><[Mycoplasma><2845..4797(+)><[Mycoplasma><4812..7322(+)><[Mycoplasma><7294..8547(+)><[Mycoplasma><[Mycoplasma><[ is replace by "to be field seperator" |[ DNA><686..1828(+)|[Mycoplasma><1828..2760(+)|[Mycoplasma><2845..4797(+)|[Mycoplasma><4812..7322(+)|[Mycoplasma><7294..8547(+)|[Mycoplasma>< in the terminal common [Mycoplasma> Myc637000176m2.csv resulting in : DNA><686..1828(+)| DnaJ><1828..2760(+)| DNA><2845..4797(+)| DNA><4812..7322(+)| seryl-tRNA><7294..8547(+)| internals are next mostly deleted with: sed -e 's/<.*>//g' Myc637000176m2.csv > Myc637000176m3.csv resulting in: DNA><686..1828(+)| DnaJ><1828..2760(+)| DNA><2845..4797(+)| DNA><4812..7322(+)| seryl-tRNA><7294..8547(+)| The single remmaining >< is replaced with potential separator | sed -e 's/> Myc637000176m4.csv resulting in: DNA|686..1828(+)| DnaJ|1828..2760(+)| DNA|2845..4797(+)| DNA|4812..7322(+)| seryl-tRNA|7294..8547(+)| BASICALLY, the clever work is now done, and the rest is more routine manipulation. A cleanup was done with: sed -e 's/)|//g' Myc637000176m4.csv > Myc637000176m5.csv sed -e 's/(/|/g' Myc637000176m5.csv > Myc637000176m6.csv together changing the (+)| to |+ ,that is a separated field The replacement of the residual .. with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file After routine manipulations within the spread sheet, a view of the overlap detection section is: F G H I J fields Start End Begin-nextEnd OR((H2<0),(H1<0)) Stable 0/1 Value, for SORTING on 686 1828 0 FALSE 0 1828 2760 85 FALSE 0 2845 4797 15 TRUE 0 4812 7322 -28 TRUE 1 7294 8547 4 TRUE 1 8551 9183 -27 TRUE 1 9156 9920 3 FALSE 0 9923 11251 0 FALSE 0 The overlapping genes have stable value 1,during sorting, while field I FALSE/TRUE and not stable during SORTing From egorleg at gmail.com Tue Feb 22 19:56:40 2011 From: egorleg at gmail.com (Kevin Egan) Date: Wed, 23 Feb 2011 00:56:40 +0000 Subject: [EMBOSS] Dot-matcher In-Reply-To: References: Message-ID: Hi I was wondering is there anywhere I could find the source code for dot-matcher? From uludag at ebi.ac.uk Wed Feb 23 04:50:28 2011 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Wed, 23 Feb 2011 09:50:28 +0000 Subject: [EMBOSS] Dot-matcher In-Reply-To: References: Message-ID: <1298454628.8626.3.camel@emboss1.ebi.ac.uk> Hi Kevin, > I was wondering is there anywhere I could find the source code for > dot-matcher? EMBOSS release tarballs include source files for EMBOSS applications and for EMBOSS libraries. ftp://emboss.open-bio.org/pub/EMBOSS/ Regards, Mahmut From jmeador45 at mac.com Tue Feb 1 03:43:50 2011 From: jmeador45 at mac.com (Jim Meador) Date: Mon, 31 Jan 2011 22:43:50 -0500 Subject: [EMBOSS] can't install emboss 6.3.1 in mac os x 10.6 In-Reply-To: References: Message-ID: <2D54C5BB-C798-43C5-8EBA-2CCC9DEEC740@mac.com> Hi Iain, I think you have the right idea. I had installed XQuartz which actually fixed some other issues I was having (but created new ones ;-) and I think I need to re-configure with x11=/opt/x11 or something like that. So I think you have the answer and thank you for the help. Sincerely, Jim (just around the corner in Cambridge) On Jan 31, 2011, at 11:05 AM, Iain Drummond wrote: > I seem to remember that i could bypass this issue by configuring EMBOSS to > install without x11. I realize that's not what you want to do. I think the > problem stems from the directory structure Apple uses for x11; i.e. Its not > where EMBOSS thinks it is. Maybe Apple put xll in a new place with 10.6? > > Iain Drummond > > > On 1/30/11 11:49 PM, "Jim Meador" wrote: > >> Hi Everyone, >> >> I am wanting to pgrade EMBOSS 6.3.1 on Mac OS X 10.6.6 (MacBook Pro 2.2 GHz, >> 4GB) from EMBOSS 6.2.0 that was installed in Leopoard (10.5) before I upgraded >> to Snow Leopard (10.6) and it seems to not do the make process correctly, >> where in the plplot directory, the .libs directory does not get created, so >> when I do sudo make install, it fails with these error messages: >> >> Making install in plplot >> make[1]: Entering directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot' >> Making install in lib >> make[2]: Entering directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot/lib' >> make[3]: Entering directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot/lib' >> make[3]: Nothing to be done for `install-exec-am'. >> test -z "/usr/local/share/EMBOSS" || ../.././install-sh -c -d >> "/usr/local/share/EMBOSS" >> /usr/bin/install -c -m 644 plstnd5.fnt plxtnd5.fnt '/usr/local/share/EMBOSS' >> make[3]: Leaving directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot/lib' >> make[2]: Leaving directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot/lib' >> make[2]: Entering directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot' >> make[3]: Entering directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot' >> test -z "/usr/local/lib" || .././install-sh -c -d "/usr/local/lib" >> /bin/sh ../libtool --mode=install /usr/bin/install -c libeplplot.la >> '/usr/local/lib' >> libtool: install: /usr/bin/install -c .libs/libeplplot.3.dylib >> /usr/local/lib/libeplplot.3.dylib >> install: .libs/libeplplot.3.dylib: No such file or directory >> make[3]: *** [install-libLTLIBRARIES] Error 71 >> make[3]: Leaving directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot' >> make[2]: *** [install-am] Error 2 >> make[2]: Leaving directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot' >> make[1]: *** [install-recursive] Error 1 >> make[1]: Leaving directory >> `/Users/jmeador/newApps/MolBioChemSoftware/MolBioGeneral/EMBOSS/EMBOSS-6.3.1/p >> lplot' >> make: *** [install-recursive] Error 1 >> >> When I look in the source directory, /EMBOSS-6.3.1/plplot/ there is no .libs >> directory as there is in another installation on a newer MacBook Pro that I >> was able to successfully install this same software (10.6.6, 2.6 GHz, 8GB). >> >> My setup is a little more complicated than I would like, since I have >> installed the eBiotools-3.0.1-leopard software which sets up an older version >> of emboss 5.0.0 within a /usr/ebiotools/ directory and uses a very nice gui >> program to access these older emboss programs, called "eBioX" (and I don't >> want to lose this). So I have been installing the newer 6.x version of emboss >> to use at the commandline and with kemboss, both of which work, sort of. I >> have to play with the environment variables to get the text output programs to >> work, but I cannot get the graphics to work from various emboss programs that >> try to make graphs, such as "charge". The text-based programs work but I want >> to get the graphics working as well as it does from the ebiotools versions and >> will probably need to re-make emboss 6.3.1 after (re)installing gd and libpng. >> However, on this mac, neither ps nor x11 work. On the newer mac, with 6.3.1 >> installed, I don't have png or gif support, but I can at least get g! >> raphs in ps and x11 to work. >> >> Does anyone have any ideas of what I may be doing wrong? Is it possible that >> some environment variable could be causing this? >> >> Any ideas will be greatly appreciated. >> Thanks, >> Jim >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss >> > > > > > The information in this e-mail is intended only for the person to whom it is > addressed. If you believe this e-mail was sent to you in error and the e-mail > contains patient information, please contact the Partners Compliance HelpLine at > http://www.partners.org/complianceline . If the e-mail was sent to you in error > but does not contain patient information, please contact the sender and properly > dispose of the e-mail. > From wo.granon at gmail.com Wed Feb 2 19:02:46 2011 From: wo.granon at gmail.com (Wolfgang Gruber) Date: Wed, 2 Feb 2011 20:02:46 +0100 Subject: [EMBOSS] Mistake in Appdoc Edialign? Message-ID: Hello, I studied the papers for DIALIGN and only in the newest Version DIALIGN-TX (Subramanian u. a., 2008) I can find the information that DIALIGN uses a guide tree. In the appdoc to edialign I read that emboss uses DIALIGN2. In this Publikation (Morgenstern, 1999) I cannot find an information that a guide tree is used. Also in the original DIALIGN2 documentation I read: "This tree is constructed by applying the UPGMA clustering method to the DIALIGN similarity scores." but nothing that this tree is used for guiding. So is this information in the emboss appdoc incorrect? At all: is there a plan to update to DIALIGN-TX? Thanks, Wolfgang Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. In: Bioinformatics (Oxford, England) Bd.?15 (1999), Nr.?3, S.?211-218. ??PMID: 10222408 Subramanian, A. ; Kaufmann, M. ; Morgenstern, B.: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. In: Algorithms for Molecular Biology Bd.?3 (2008), Nr.?1, S.?6 From oliver.liegmann at biologie.uni-freiburg.de Fri Feb 4 07:53:38 2011 From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann) Date: Fri, 04 Feb 2011 08:53:38 +0100 Subject: [EMBOSS] seqret does not find sequence after update Message-ID: <1296806018.12454.29.camel@yoda> Dear list members, does some of you also got this problem (and probably has an idea on what's going wrong): After upgrading from version 6.2.0 to 6.3.1 seqret does not work properly anymore: First, Emboss was installed using ./configure --enable-64 --prefix=/opt/emboss make make install The database was set up with: dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas Using seqret to retrieve the sequences produces an error: seqret plafa_test:PLAFA_MAL13P1.23-b Reads and writes (returns) sequences output sequence(s) [plafa_mal13p1.fasta]: Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a' Only the remaining two sequences are stored in the output file. Are the allowed characters used in the accession changed? With Emboss 6.2.0 we did not have any problems, but after upgrade a huge bunch of sequences could not be retrieved anymore when used with our internal fasta database, although the output in outfile.dbifasta shows all sequences to be inserted into the database. The content of the different files are: PLAFA_test.fas: >PLAFA_MAL13P1.23-b MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH >PLAFA_MAL13P1.237a MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL >PLAFA_MAL13P1.23-a MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH test.txt: plafa:PLAFA_MAL13P1.23-b plafa:PLAFA_MAL13P1.237a plafa:PLAFA_MAL13P1.23-a emboss.default: DB plafa [ format: fasta method: emblcd directory: /home/liegmann/genomezoo/emboss/prob/test/db type: P ] Best regards, Oliver Liegmann -- Dipl.-Inf. Oliver Liegmann AG Rensing Fakult?t f?r Biologie Albert-Ludwigs-Universit?t Freiburg Hauptstra?e 1 D-79104 Freiburg +49 761 203-2521 oliver.liegmann at biologie.uni-freiburg.de http://www.plantco.de/people/Oliver.html -------------- next part -------------- A non-text attachment was scrubbed... Name: outfile.dbifasta Type: application/octet-stream Size: 848 bytes Desc: not available URL: From Caroline.Barretto at rdls.nestle.com Tue Feb 8 10:01:28 2011 From: Caroline.Barretto at rdls.nestle.com (Barretto, Caroline, LAUSANNE, BioInformatics) Date: Tue, 8 Feb 2011 11:01:28 +0100 Subject: [EMBOSS] diffseq memory problem? Message-ID: Dear EMBOSS developers, I have been using diffseq to compare too strains of the same bacteria species using "10" as wordsize without any problem. However, when I try to reduce this number to "4", after several hours of calculation the server collapses, all RAM and SWAP are used. Is there any option to avoid that, or do you know if someone is working on that problem? Many thanks, Best regards, Caroline. From pmr at ebi.ac.uk Tue Feb 8 10:46:32 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 08 Feb 2011 10:46:32 +0000 Subject: [EMBOSS] diffseq memory problem? In-Reply-To: References: Message-ID: <4D511F08.7010206@ebi.ac.uk> Dear Caroline, On 08/02/2011 10:01, Barretto, Caroline, LAUSANNE, BioInformatics wrote: > Dear EMBOSS developers, > > I have been using diffseq to compare too strains of the same bacteria > species using "10" as wordsize without any problem. > > However, when I try to reduce this number to "4", after several hours of > calculation the server collapses, all RAM and SWAP are used. > > Is there any option to avoid that, or do you know if someone is working > on that problem? Depending on the input size, and the number of simple repeats, a low word size could easily generate too many matches for large sequence lengths. We would recommend reducing the word size more slowly (maybe 10, 8, 6). As a guideline, finding more matches than there are non-overlapping words in the sequence is unlikely to be useful and is a reasonable point to stop reducing the word size. Meanwhile, we will take a look at diffseq in case there is some way to improve its performance or to warn an early stage if the word size appears small for the input sequence lengths and may generate too many matches. Hope this helps Peter Rice EMBOSS Team From WulfDirk.Leuschner at sanofi-aventis.com Thu Feb 10 07:43:17 2011 From: WulfDirk.Leuschner at sanofi-aventis.com (WulfDirk.Leuschner at sanofi-aventis.com) Date: Thu, 10 Feb 2011 08:43:17 +0100 Subject: [EMBOSS] lit. references for EMBOSS data files, e.g. Epk.dat (iep usage) Message-ID: <650F10565E484347B51CF6679663A80402F51F19@ffpw10.f2.enterprise> Hi all, I was wondering whether someone might know something about how some of the meta data used in EMBOSS were compiled. A colleague of mine was looking for a reference for the Epk.dat values used for the determination of the isoelectric point of a protein. However, neither she nor I could find anything... Any hints? Wulf Dirk Leuschner From jison at ebi.ac.uk Thu Feb 10 16:42:34 2011 From: jison at ebi.ac.uk (Jon Ison) Date: Thu, 10 Feb 2011 16:42:34 -0000 (UTC) Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D41F992.5030900@dartmouth.edu> References: <4D41F992.5030900@dartmouth.edu> Message-ID: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> Hi Lionel Didn't see a reply to you, sorry. Anyhow, dreg will search the sequence as given. This is taken as the sense/coding/+ strand. If you specify -sreverse (which is available to any applications that read sequences) it will I think search the reverse complement of that sequence instead. Cheers Jon > Hello fellow EMBOSS fans, > > I am using the dreg program to search the human genome for my favorite > motif. I was unable to find any information regarding the meaning of > the strand information in the output. Does dreg search both strands or > will it always return "+" as the strand designation of the hits that it > finds? > > Thanks for your continued support and development of this fantastic tool! > > Sincerely, > Lionel "Lee" Brooks 3rd > Dartmouth Genetics Grad Student > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From sigve.nakken at medisin.uio.no Fri Feb 11 10:54:44 2011 From: sigve.nakken at medisin.uio.no (Sigve Nakken) Date: Fri, 11 Feb 2011 11:54:44 +0100 Subject: [EMBOSS] DNA sequence as input argument Message-ID: <4D551574.2080004@medisin.uio.no> Hi, Is there any way in which one can read a DNA sequence directly from the command line (that is as a string input argument) rather than from a file? I am especially interested in finding repeats, inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead of creating a FASTA file for each query sequene, I would like to read the sequence directly from the command line. Is this possible? Kind regards, Sigve From pmr at ebi.ac.uk Fri Feb 11 11:11:13 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 11 Feb 2011 11:11:13 +0000 Subject: [EMBOSS] DNA sequence as input argument In-Reply-To: <4D551574.2080004@medisin.uio.no> References: <4D551574.2080004@medisin.uio.no> Message-ID: <4D551951.7030406@ebi.ac.uk> Dear Sigve, On 11/02/2011 10:54, Sigve Nakken wrote: > Hi, > > Is there any way in which one can read a DNA sequence directly from the > command line (that is > as a string input argument) rather than from a file? I am especially > interested in finding repeats, > inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead > of creating a FASTA file > for each query sequene, I would like to read the sequence directly from > the command line. Is this possible? seqret asis::ctgatcgatgctagctgac the "asis" format was included exactly for this purpose. You do need to take care that a long sequence is not too long for your shell to handle on the command line (a shell issue, not an EMBOSS issue). You can also add to the command line: -sid abc123 This will give it an ID of abc123 and the output file will default to (for seqret) abc123.fasta and will have the abc123 identifier in it. Hope this helps Peter Rice EMBOSS Team From stephen.taylor at imm.ox.ac.uk Fri Feb 11 11:13:45 2011 From: stephen.taylor at imm.ox.ac.uk (Steve Taylor) Date: Fri, 11 Feb 2011 11:13:45 +0000 Subject: [EMBOSS] DNA sequence as input argument In-Reply-To: <4D551574.2080004@medisin.uio.no> References: <4D551574.2080004@medisin.uio.no> Message-ID: <4D5519E9.1050401@imm.ox.ac.uk> Hi, > > Is there any way in which one can read a DNA sequence directly from the > command line (that is > as a string input argument) rather than from a file? I am especially > interested in finding repeats, > inverted repeats etc. (e.g. 'einverted', 'etandem' EMBOSS apps). Instead > of creating a FASTA file > for each query sequene, I would like to read the sequence directly from > the command line. Is this possible? > From http://emboss.sourceforge.net/docs/faq.html A) The "filename" is really the sequence. This is a quick and easy way of reading in a short fragment of sequence without having to enter it into a file. For example: % program -seq asis::ATGGTGAGGAGAGTTGTGATGAGA Steve From Lionel.Brooks at dartmouth.edu Fri Feb 11 18:22:38 2011 From: Lionel.Brooks at dartmouth.edu (Lionel (Lee) Brooks 3rd) Date: Fri, 11 Feb 2011 13:22:38 -0500 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> Message-ID: <4D557E6E.3020608@dartmouth.edu> Hi Jon, Thank you! Apparently, I should have rtfm more than once. Sincerely, Lionel Jon Ison wrote: > Hi Lionel > > Didn't see a reply to you, sorry. > > Anyhow, dreg will search the sequence as given. This is taken as the sense/coding/+ strand. > > If you specify -sreverse (which is available to any applications that read sequences) it will I > think search the reverse complement of that sequence instead. > > Cheers > > Jon > > > > >> Hello fellow EMBOSS fans, >> >> I am using the dreg program to search the human genome for my favorite >> motif. I was unable to find any information regarding the meaning of >> the strand information in the output. Does dreg search both strands or >> will it always return "+" as the strand designation of the hits that it >> finds? >> >> Thanks for your continued support and development of this fantastic tool! >> >> Sincerely, >> Lionel "Lee" Brooks 3rd >> Dartmouth Genetics Grad Student >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss >> >> > > > From pmr at ebi.ac.uk Fri Feb 11 18:46:22 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 11 Feb 2011 18:46:22 +0000 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D557E6E.3020608@dartmouth.edu> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> <4D557E6E.3020608@dartmouth.edu> Message-ID: <4D5583FE.1060703@ebi.ac.uk> Dear Lee, On 11/02/2011 18:22, Lionel (Lee) Brooks 3rd wrote: > Hi Jon, > > Thank you! Apparently, I should have rtfm more than once. True ... but it not obvious which part to (re-)read. We could make it easier. Perhaps the "Input sequence" could point to the sequence qualifiers and the USA syntax. We will look at improving this part of the documentation in the next release. regards, Peter Rice EMBOSS Team From db60 at st-andrews.ac.uk Sat Feb 12 12:07:03 2011 From: db60 at st-andrews.ac.uk (Daniel Barker) Date: Sat, 12 Feb 2011 12:07:03 +0000 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D5583FE.1060703@ebi.ac.uk> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> <4D557E6E.3020608@dartmouth.edu> <4D5583FE.1060703@ebi.ac.uk> Message-ID: <4D5677E7.1080402@st-andrews.ac.uk> Dear Peter, A lot of the time for nucleotide stuff it makes sense to search both strands. Of course, it isn't hard to search one strand, then the other. But this introduces an extra step. I wonder if there could be some convenient option to do this, and if it should perhaps be the default? (As with NCBI blastall with any kind of nucleotide search.) This would affect programs beyond just dreg and, though it would be OK for our work, perhaps it wouldn't make sense for others. Just a thought. Best regards, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 From pmr at ebi.ac.uk Sat Feb 12 16:58:32 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Sat, 12 Feb 2011 16:58:32 +0000 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D5677E7.1080402@st-andrews.ac.uk> References: <4D41F992.5030900@dartmouth.edu> <46070.172.22.100.208.1297356154.squirrel@webmail.ebi.ac.uk> <4D557E6E.3020608@dartmouth.edu> <4D5583FE.1060703@ebi.ac.uk> <4D5677E7.1080402@st-andrews.ac.uk> Message-ID: <4D56BC38.9070408@ebi.ac.uk> Dear Daniel, On 12/02/2011 12:07, Daniel Barker wrote: > A lot of the time for nucleotide stuff it makes sense to search both > strands. Of course, it isn't hard to search one strand, then the other. > But this introduces an extra step. I wonder if there could be some > convenient option to do this, and if it should perhaps be the default? > (As with NCBI blastall with any kind of nucleotide search.) > > This would affect programs beyond just dreg and, though it would be OK > for our work, perhaps it wouldn't make sense for others. Just a thought. Interesting suggestion. Maybe we can add a -bothstrands option for applications to search the forward and reverse strands. We need to consider: * Do the results make sense? * What default do we set (maybe some programs have a different default)? * Is this complicated for programs that can use DNA or protein input? * Can we apply it to applications aligning two sequences? meanwhile, running twice with -sreverse the second time will find you all the matches. regards, Peter Rice EMBOSS Team From david.bauer at bayer.com Mon Feb 14 07:36:46 2011 From: david.bauer at bayer.com (david.bauer at bayer.com) Date: Mon, 14 Feb 2011 08:36:46 +0100 Subject: [EMBOSS] dreg: does it search both strands? In-Reply-To: <4D56BC38.9070408@ebi.ac.uk> Message-ID: Hi Daniel & Peter, emboss-bounces at lists.open-bio.org schrieb am 12/02/2011 17:58:32: > Dear Daniel, > > On 12/02/2011 12:07, Daniel Barker wrote: > > A lot of the time for nucleotide stuff it makes sense to search both > > strands. Of course, it isn't hard to search one strand, then the other. > > But this introduces an extra step. I wonder if there could be some > > convenient option to do this, and if it should perhaps be the default? > > (As with NCBI blastall with any kind of nucleotide search.) > > > > This would affect programs beyond just dreg and, though it would be OK > > for our work, perhaps it wouldn't make sense for others. Just a thought. > I think another candidate would be fuzznuc. (That's at least the program, where I sometimes missed this option ;-) > Interesting suggestion. > > Maybe we can add a -bothstrands option for applications to search the > forward and reverse strands. Yes, this would add the new functionality without breaking the old default behaviour of the programs. > We need to consider: > * Do the results make sense? > * What default do we set (maybe some programs have a different default)? As mentioned above, I would not touch the old default settings and add searching both strands as an option. (e.g. in stssearch the search of both strands is already the default.) > * Is this complicated for programs that can use DNA or protein input? > * Can we apply it to applications aligning two sequences? I think it could make sense for programs which can align one sequence against a set of other sequences (e.g. water, needle). Regards, David. From marvin.stodolsky at gmail.com Mon Feb 14 23:35:45 2011 From: marvin.stodolsky at gmail.com (Marvin Stodolsky) Date: Mon, 14 Feb 2011 18:35:45 -0500 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: References: Message-ID: This is elementary I?m sure, but I?ve been unable to work out the syntax from the documentation. More minor issue. When using infoseq to extract all the fasta Headers from a sequence Repository, the GeneBegin..GeneEnd (like?? 234466..234589) often fails to come as a uniform field/fields in a resultant spreadsheet.? Is there a Fix for this? MarvS From pmr at ebi.ac.uk Tue Feb 15 08:59:20 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 15 Feb 2011 08:59:20 +0000 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: References: Message-ID: <4D5A4068.4000302@ebi.ac.uk> On 14/02/2011 23:35, Marvin Stodolsky wrote: > This is elementary I?m sure, but I?ve been unable to work out the > syntax from the documentation. > More minor issue. > > When using infoseq to extract all the fasta Headers from a sequence > Repository, the GeneBegin..GeneEnd (like 234466..234589) often fails to > come as a uniform field/fields in a resultant spreadsheet. Is there a Fix > for this? I don't see the genebegin and geneend in EMBOSS infoseq output. Are they part of the sequence ID in the FASTA file? You can use a delimiter between items for infoseq using: -nocolumn on the command line. For import into a spreadsheet you can set the delimiter to be tab with: -nocolumn -delimiter "\t" on the command line. That should then import nicely into a spreadsheet. Hope that helps Peter Rice EMBOSS Team From mathog at caltech.edu Wed Feb 16 20:54:05 2011 From: mathog at caltech.edu (David Mathog) Date: Wed, 16 Feb 2011 12:54:05 -0800 Subject: [EMBOSS] Transeq question, frame phases Message-ID: Test case fasta file >8Achars AAAAAAAA all 6 frames for transeq, standard mode emits: >_1 KKX >_2 KKX >_3 KK >_4 FF >_5 FFX >_6 FFX But... AAAAAAAA Forward TTTTTTTT Reverse abc cba <--- codons in diagram ^a^^b^^c^ phase 1 1 KKX 4 XFF x^a^^b^^c^ phase 2 2 KKX 5 XFF xx^a^^b^ phase 3 3 KK 6 FF That is, frames 4->6 are supposed to use, respectively, the same set of codons as 1->3, but translate on the opposite strand, shouldn't the number of residues returned be like in the table above, also with the X at the beginning rather than the end? Or to put it another way, shouldn't the little "x" bases be ignored on the - strand if they were also ignored on the +? Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Wed Feb 16 22:47:15 2011 From: mathog at caltech.edu (David Mathog) Date: Wed, 16 Feb 2011 14:47:15 -0800 Subject: [EMBOSS] Transeq question, frame phases Message-ID: Here is another worked example with a small but real mRNA fragment. (Best cut and paste it into a program with a fixed width font). Test sequence: >for (AKA gi|1728|emb|V00893.1, this is "+" direction) TCGAAAACCGGGCCATGAAGGATGAGGAGAAGATGGAGCTGCA GGAGATGCAGCTGAAGGAGGCCAAGCACATTGCCGAGGACTCA GACCGCAAATACGAGGAGGTGGCCAGGAAGCTGGTGATCCTCGA >rev (for reversed) TCGAGGATCACCAGCTTCCTGGCCACCTCCTCGTATTTGCGGT CTGAGTCCTCGGCAATGTGCTTGGCCTCCTTCAGCTGCATCTC CTGCAGCTCCATCTTCTCCTCATCCTTCATGGCCCGGTTTTCGA Transeq output, all 6 frames, for >for and >rev >for_1 SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX >for_2 RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >for_3 ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX >for_4 RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >for_5 SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX >for_6 EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX >rev_1 SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX >rev_2 RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >rev_3 EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX >rev_4 RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >rev_5 SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX >rev_6 ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX Output from a different program, all 12 frame options shown on the fasta header line as: phase(strand) Positive phases are measured from sequence position 1. Negative phases measured from sequence position N, the last base in the sequence. This program differs from transeq in that any partial codon is emitted as an X. Note how transeq output never starts with an X, whereas here the X maintains its position on the Nucleic acid sequence, for instance, +1(+) and +1(-). >gi|1728|emb|V00893.1|[+1(+)] SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SSX >gi|1728|emb|V00893.1|[+2(+)] RKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >gi|1728|emb|V00893.1|[+3(+)] ENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVILX >gi|1728|emb|V00893.1|[+1(-)] XRGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >gi|1728|emb|V00893.1|[+2(-)] SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFS >gi|1728|emb|V00893.1|[+3(-)] XEDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVF >gi|1728|emb|V00893.1|[-1(-)] SRITSFLATSSYLRSESSAMCLASFSCISCSSIFSSSFMARFSX >gi|1728|emb|V00893.1|[-2(-)] RGSPASWPPPRICGLSPRQCAWPPSAASPAAPSSPHPSWPGFR >gi|1728|emb|V00893.1|[-3(-)] EDHQLPGHLLVFAV*VLGNVLGLLQLHLLQLHLLLILHGPVFX >gi|1728|emb|V00893.1|[-1(+)] XRKPGHEG*GEDGAAGDAAEGGQAHCRGLRPQIRGGGQEAGDPR >gi|1728|emb|V00893.1|[-2(+)] SKTGP*RMRRRWSCRRCS*RRPSTLPRTQTANTRRWPGSW*SS >gi|1728|emb|V00893.1|[-3(+)] XENRAMKDEEKMELQEMQLKEAKHIAEDSDRKYEEVARKLVIL >gi|1728|emb|V00893.1| Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From marvin.stodolsky at gmail.com Thu Feb 17 02:07:51 2011 From: marvin.stodolsky at gmail.com (Marvin Stodolsky) Date: Wed, 16 Feb 2011 21:07:51 -0500 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: <4D5A4068.4000302@ebi.ac.uk> References: <4D5A4068.4000302@ebi.ac.uk> Message-ID: All thanks for the suggestions. A solution to the GeneBegin..GeneEnd problem has been worked out, per the Attachment, for those interested. But for me the more important problem is making a FASTA repository, which is a subset of the gene files in a much larger Repository. This is desirable before & after using Usearch - http://www.drive5.com/usearch/intro.html to select out a minimally homologous gene set of a species. Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among the undesirables. Specifically, is the command using ENTRET or relatives , to accept a list like 637008924 637008927 640691430 640691431 637008928 637008954 637008980 for extraction and repacking into a single smaller Repository? If not, could you recommend a software tool/suite for this type of job. MarvS On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice wrote: > On 14/02/2011 23:35, Marvin Stodolsky wrote: >> >> ?This is elementary I?m sure, but I?ve been unable to work out the >> syntax ?from the documentation. >> More minor issue. >> >> When using infoseq to extract all the fasta Headers from a sequence >> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to >> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix >> for this? > > I don't see the genebegin and geneend in EMBOSS infoseq output. Are they > part of the sequence ID in the FASTA file? > > You can use a delimiter between items for infoseq using: > > ?-nocolumn > > on the command line. > > For import into a spreadsheet you can set the delimiter to be tab with: > > ?-nocolumn -delimiter "\t" > > on the command line. That should then import nicely into a spreadsheet. > > Hope that helps > > Peter Rice > EMBOSS Team > From biopython at maubp.freeserve.co.uk Thu Feb 17 11:05:14 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Feb 2011 11:05:14 +0000 Subject: [EMBOSS] Transeq question, frame phases In-Reply-To: References: Message-ID: On Wed, Feb 16, 2011 at 8:54 PM, David Mathog wrote: > Test case fasta file >>8Achars > AAAAAAAA > > all 6 frames for transeq, standard mode emits: >>_1 > KKX >>_2 > KKX >>_3 > KK >>_4 > FF >>_5 > FFX >>_6 > FFX > Note you can do that with a single command line: $ transeq asis:AAAAAAAA -filter -frame 6 >asis_1 KKX >asis_2 KKX >asis_3 KK >asis_4 FF >asis_5 FFX >asis_6 FFX Note that while using 1, 2, 3 for the forward frames is well defined, there are two conventions for the reverse frame - do you start from the left or the right? First let's just do the forward frames, $ transeq asis:AAAAAAAA -filter -frame 1 >asis_1 KKX $ transeq asis:AAAAAAAA -filter -frame 2 >asis_2 KKX $ transeq asis:AAAAAAAA -filter -frame 3 >asis_3 KK Are you happy with them? Now let's do that with the reverse complement strand: $ transeq asis:TTTTTTTT -filter -frame 1 >asis_1 FFX $ transeq asis:TTTTTTTT -filter -frame 2 >asis_2 FFX $ transeq asis:TTTTTTTT -filter -frame 3 >asis_3 FF Now let's do that with the original sequence but the negative frames: $ transeq asis:AAAAAAAA -filter -frame -3 >asis_6 FFX $ transeq asis:AAAAAAAA -filter -frame -2 >asis_5 FFX $ transeq asis:AAAAAAAA -filter -frame -1 >asis_4 FF Same results - perhaps the naming isn't as you expected? Peter From oliver.liegmann at biologie.uni-freiburg.de Thu Feb 17 13:16:46 2011 From: oliver.liegmann at biologie.uni-freiburg.de (Oliver Liegmann) Date: Thu, 17 Feb 2011 14:16:46 +0100 Subject: [EMBOSS] seqret does not find sequence after update In-Reply-To: <40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk> References: <1296806018.12454.29.camel@yoda> <40534.86.26.12.63.1296848044.squirrel@webmail.ebi.ac.uk> Message-ID: <1297948606.14091.8.camel@yoda> Hello, thank you very much for your reply. dbifasta with "C" locales and dbxfasta both seem to work well. To answer your question: The operating system is Ubuntu 10.04 (Lucid) with locales set to de_DE.utf8. Best regards, Oliver Liegmann P.S.: I CC'ed this message to the list, for other users to know about the workaround. Probably a short note should be written into the documentation of dbifasta about the locales issue. Am Freitag, den 04.02.2011, 19:34 +0000 schrieb ajb at ebi.ac.uk: > Hello, > > I could reproduce your problem. It appears to be a manifestation of the > GNU sort "sorting order". If you, depending on your shell, do: > > export LC_ALL=C > > or > > setenv LC_ALL C > > and then re-index using dbifasta then retrieval should work as expected. > Alternatively use to dbx indexing system which does not rely on > GNU sort. > > Incidentally, what operating system and version are you using? > > HTH > > Alan Bleasby > EBI > > > > > Dear list members, > > > > does some of you also got this problem (and probably has an idea on what's > > going wrong): > > > > After upgrading from version 6.2.0 to 6.3.1 seqret does not work > > properly anymore: > > First, Emboss was installed using > > ./configure --enable-64 --prefix=/opt/emboss > > make > > make install > > > > The database was set up with: > > dbifasta -dbname plafa -idformat simple -filenames PLAFA_test.fas > > > > Using seqret to retrieve the sequences produces an error: > > seqret plafa_test:PLAFA_MAL13P1.23-b > > Reads and writes (returns) sequences > > output sequence(s) [plafa_mal13p1.fasta]: > > Error: Failed to read sequence 'plafa:PLAFA_MAL13P1.237a' > > > > Only the remaining two sequences are stored in the output file. > > > > Are the allowed characters used in the accession changed? With Emboss > > 6.2.0 we did not have any problems, but after upgrade a huge bunch of > > sequences could not be retrieved anymore when used with our internal fasta > > database, although the output in > > outfile.dbifasta shows all sequences to be inserted into the database. > > > > > > The content of the different files are: > > PLAFA_test.fas: > >>PLAFA_MAL13P1.23-b > > MLTCFLFYIYEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH > >>PLAFA_MAL13P1.237a > > MKNTFFFVLSFFLYITILDITLTSLIQKNILKEKVDKEYMKVFLFVNNSQKYCEKDNIIL > >>PLAFA_MAL13P1.23-a > > MSFESFVLKDEKKASNKKYDYDEIDLNDDDDDIIDNKSFDKNNYSYNIKNRLFKHYKKVH > > > > > > test.txt: > > plafa:PLAFA_MAL13P1.23-b > > plafa:PLAFA_MAL13P1.237a > > plafa:PLAFA_MAL13P1.23-a > > > > > > emboss.default: > > DB plafa [ > > format: fasta > > method: emblcd > > directory: /home/liegmann/genomezoo/emboss/prob/test/db > > type: P > > ] > > > > > > > > Best regards, > > Oliver Liegmann -- Dipl.-Inf. Oliver Liegmann AG Rensing Fakult?t f?r Biologie Albert-Ludwigs-Universit?t Freiburg Hauptstra?e 1 D-79104 Freiburg +49 761 203-2521 oliver.liegmann at biologie.uni-freiburg.de http://www.plantco.de/people/Oliver.html MOSS 2011 - the annual meeting on bryophyte research http://plantco.de/MOSS2011/ From mathog at caltech.edu Thu Feb 17 16:30:25 2011 From: mathog at caltech.edu (David Mathog) Date: Thu, 17 Feb 2011 08:30:25 -0800 Subject: [EMBOSS] Transeq question, frame phases Message-ID: > Now let's do that with the reverse complement strand: > > $ transeq asis:TTTTTTTT -filter -frame 1 > >asis_1 > FFX > $ transeq asis:TTTTTTTT -filter -frame 2 > >asis_2 > FFX > $ transeq asis:TTTTTTTT -filter -frame 3 > >asis_3 > FF That is the problem. Let me try to explain more clearly what the issue is. AAAAAAAA Forward TTTTTTTT Reverse abc cba <--- codons in diagram ^a^^b^^c^ phase 1 1 KKX 4 XFF EXPECTED x^a^^b^^c^ phase 2 2 KKX 5 XFF EXPECTED xx^a^^b^ phase 3 3 KK 6 FF EXPECTED ^a^^b^^c^ phase 1 1 KKX 4 FF OBSERVED x^a^^b^^c^ phase 2 2 KKX 5 FFX OBSERVED xx^a^^b^ phase 3 3 KK 6 FFX OBSERVED Assume an extra codon L to the left of a. abc baL <--- codons in diagram ^a^^b^^c^ phase 1 1 KKX 4 FF EXPLAINED? ^^a^^b^^c^ phase 2 2 KKX 5 FFX EXPLAINED? L^^a^^b^ phase 3 3 KK 6 FFX EXPLAINED? That is, if the meaning of the + phases is to define the three codons a,b,c as shown in the diagram, such that the forward translation is as shown, then the reverse translation should be as shown above in expected. That is, it is the translation of the exact same set of codons done individually, but for the - strand reverse complement the codon first, and then invert the resulting translated sequence. That way the X, where it occurs is attached to the same partial codon "c". What I think is happening in transeq is that it is starting with the first full codon in the frame on the given strand. In effect that shifts the translated codons as shown in the "EXPLAINED?" section. If partial codons were not translated then these would all be equivalent. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From biopython at maubp.freeserve.co.uk Thu Feb 17 17:03:15 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Feb 2011 17:03:15 +0000 Subject: [EMBOSS] Transeq question, frame phases In-Reply-To: References: Message-ID: On Thu, Feb 17, 2011 at 4:30 PM, David Mathog wrote: > > >> Now let's do that with the reverse complement strand: >> >> $ transeq asis:TTTTTTTT -filter -frame 1 >> >asis_1 >> FFX This is what I think that does (forward frames are easy): Frame 1, so starts at first base: Letters 123, codon TTT, gives F Letters 456, codon TTT, gives F Letters 78, partial codon TT-, gives X >> $ transeq asis:TTTTTTTT -filter -frame 2 >> >asis_2 >> FFX Frame 2, so starts at second base: Letter 1, just T, ignored Letters 234, codon TTT, gives F Letters 567, codon TTT, gives F Letters 8, partial codon T--, gives X >> $ transeq asis:TTTTTTTT -filter -frame 3 >> >asis_3 >> FF Frame 3, so starts at third base: Letters 12, bases TT, ignored Letters 345, codon TTT, gives F Letters 678, codon TTT, gives F > That is the problem. ?Let me try to explain more clearly what the issue is. > > That is, if the meaning of the + phases is to define the three codons > a,b,c as shown in the diagram, such that the forward translation is as > shown, then the reverse translation should be as shown above in > expected. That is, it is the translation of the exact same set of > codons done individually, but for the - strand reverse complement the > codon first, and then invert the resulting translated sequence. That > way the X, where it occurs is attached to the same partial codon "c". I couldn't understand your diagram - probably font spacing issues in part. The EMBOSS tool is doing all six frames, maybe all you need to work out the is mapping between its naming and yours. Note that it can make sense to translate a trailing partial codon, e.g. TC... could be TCA, TCC, TCG or TCT which all code for S: $ transeq asis:TCN -filter >asis_1 S $ transeq asis:TC -filter >asis_1 S Peter From marvin.stodolsky at gmail.com Fri Feb 18 02:23:15 2011 From: marvin.stodolsky at gmail.com (Marvin Stodolsky) Date: Thu, 17 Feb 2011 21:23:15 -0500 Subject: [EMBOSS] FW: Reducing a FASTA repository, new user In-Reply-To: <825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu> References: <4D5A4068.4000302@ebi.ac.uk> <825156C7-E9C0-47CE-9C32-C1EB71EE9002@ohsu.edu> Message-ID: Sorry, Here is the attachment. The whole cleanup process could be done with pm;y SED calls I'm sure, but would be beyond my SED comfort level. MarvS On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller wrote: > HI Martin, > I am interested i the solution. There was no attachment to the email I received. Would you mind sending it? > > thank you, > Tom > MMI DNA Services Core Facility > 503-494-2442 > kellert at ohsu.edu > Office: 6588 RJH (CROET/BasicScience) > > > > > > On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote: > >> All thanks for the suggestions. ?A solution to the GeneBegin..GeneEnd >> problem has been worked out, per the Attachment, for those interested. >> >> But for me the more important problem is making a FASTA repository, >> which is a subset of the gene files in a much larger Repository. ?This >> is desirable before & after using Usearch - >> http://www.drive5.com/usearch/intro.html >> to select out a minimally homologous gene set of a species. >> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among >> the undesirables. >> >> Specifically, is the command using ENTRET or relatives , to accept a list like >> 637008924 >> 637008927 >> 640691430 >> 640691431 >> 637008928 >> 637008954 >> 637008980 >> for extraction and repacking into a single smaller Repository? >> >> If not, could you recommend a software tool/suite for this type of job. >> >> MarvS >> >> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice wrote: >>> On 14/02/2011 23:35, Marvin Stodolsky wrote: >>>> >>>> ?This is elementary I?m sure, but I?ve been unable to work out the >>>> syntax ?from the documentation. >>>> More minor issue. >>>> >>>> When using infoseq to extract all the fasta Headers from a sequence >>>> Repository, the GeneBegin..GeneEnd (like ? 234466..234589) often fails to >>>> come as a uniform field/fields in a resultant spreadsheet. ?Is there a Fix >>>> for this? >>> >>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they >>> part of the sequence ID in the FASTA file? >>> >>> You can use a delimiter between items for infoseq using: >>> >>> ?-nocolumn >>> >>> on the command line. >>> >>> For import into a spreadsheet you can set the delimiter to be tab with: >>> >>> ?-nocolumn -delimiter "\t" >>> >>> on the command line. That should then import nicely into a spreadsheet. >>> >>> Hope that helps >>> >>> Peter Rice >>> EMBOSS Team >>> >> >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > > -------------- next part -------------- With respect to using info in the FASTA description field, the intent and partial solution can now be explained. The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned. The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0 DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37] DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37] DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37] seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37] thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37] In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes. A non-elegant fix has been worked out. Pulling the dataset into a spreadsheet, spaces in the description field were next replaced with >< : DnaJ><1828..2760(+)><[Mycoplasma><2845..4797(+)><[Mycoplasma><4812..7322(+)><[Mycoplasma><7294..8547(+)><[Mycoplasma><[Mycoplasma><[ is replace by "to be field seperator" |[ DNA><686..1828(+)|[Mycoplasma><1828..2760(+)|[Mycoplasma><2845..4797(+)|[Mycoplasma><4812..7322(+)|[Mycoplasma><7294..8547(+)|[Mycoplasma>< in the terminal common [Mycoplasma> Myc637000176m2.csv resulting in : DNA><686..1828(+)| DnaJ><1828..2760(+)| DNA><2845..4797(+)| DNA><4812..7322(+)| seryl-tRNA><7294..8547(+)| internals are next mostly deleted with: sed -e 's/<.*>//g' Myc637000176m2.csv > Myc637000176m3.csv resulting in: DNA><686..1828(+)| DnaJ><1828..2760(+)| DNA><2845..4797(+)| DNA><4812..7322(+)| seryl-tRNA><7294..8547(+)| The single remmaining >< is replaced with potential separator | sed -e 's/> Myc637000176m4.csv resulting in: DNA|686..1828(+)| DnaJ|1828..2760(+)| DNA|2845..4797(+)| DNA|4812..7322(+)| seryl-tRNA|7294..8547(+)| BASICALLY, the clever work is now done, and the rest is more routine manipulation. A cleanup was done with: sed -e 's/)|//g' Myc637000176m4.csv > Myc637000176m5.csv sed -e 's/(/|/g' Myc637000176m5.csv > Myc637000176m6.csv together changing the (+)| to |+ ,that is a separated field The replacement of the residual .. with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file After routine manipulations within the spread sheet, a view of the overlap detection section is: F G H I J fields Start End Begin-nextEnd OR((H2<0),(H1<0)) Stable 0/1 Value, for SORTING on 686 1828 0 FALSE 0 1828 2760 85 FALSE 0 2845 4797 15 TRUE 0 4812 7322 -28 TRUE 1 7294 8547 4 TRUE 1 8551 9183 -27 TRUE 1 9156 9920 3 FALSE 0 9923 11251 0 FALSE 0 The overlapping genes have stable value 1,during sorting, while field I FALSE/TRUE and not stable during SORTing From egorleg at gmail.com Wed Feb 23 00:56:40 2011 From: egorleg at gmail.com (Kevin Egan) Date: Wed, 23 Feb 2011 00:56:40 +0000 Subject: [EMBOSS] Dot-matcher In-Reply-To: References: Message-ID: Hi I was wondering is there anywhere I could find the source code for dot-matcher? From uludag at ebi.ac.uk Wed Feb 23 09:50:28 2011 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Wed, 23 Feb 2011 09:50:28 +0000 Subject: [EMBOSS] Dot-matcher In-Reply-To: References: Message-ID: <1298454628.8626.3.camel@emboss1.ebi.ac.uk> Hi Kevin, > I was wondering is there anywhere I could find the source code for > dot-matcher? EMBOSS release tarballs include source files for EMBOSS applications and for EMBOSS libraries. ftp://emboss.open-bio.org/pub/EMBOSS/ Regards, Mahmut