From jboddu at uiuc.edu Tue Jun 10 10:50:12 2008 From: jboddu at uiuc.edu (Jay) Date: Tue, 10 Jun 2008 09:50:12 -0500 Subject: [EMBOSS] sequence retrieval Message-ID: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> Hi: I am brand new to EMBOSS and bioinformatics. I have a large file with sequences in fasta format. They have IDs. Is there any EMBOSS way to retrieve sequences by inputting a text file with a short listed IDs? Thanks Jay From pmr at ebi.ac.uk Tue Jun 10 12:11:50 2008 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 10 Jun 2008 17:11:50 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> Message-ID: <484EA7C6.2050904@ebi.ac.uk> Jay wrote: > I have a large file with sequences in fasta format. They have IDs. > > Is there any EMBOSS way to retrieve sequences by inputting a text file with > a short listed IDs? With EMBOSS you can refer to sequences in the file: filename:id You can also put a list of these into a file, and use that with @listfilename But this can be slow - it will read the file for each ID. You can also index the file with dbxfasta (or dbifasta) as a private database then define a database in your .embossrc file and use the dbname:id syntax (again you can use a list file, but it will be much faster) Hope this helps. If you need more help setting up please ask again! regards, Peter From rls at ebi.ac.uk Tue Jun 10 13:09:42 2008 From: rls at ebi.ac.uk (Rodrigo Lopez) Date: Tue, 10 Jun 2008 18:09:42 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <484EA7C6.2050904@ebi.ac.uk> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> Message-ID: <484EB556.4050307@ebi.ac.uk> Alternatively, look into dbfetch and wsdbfetch Web Services: http://www.ebi.ac.uk/dbfetch http://www.ebi.ac.uk/Tools/webservices All the EMBOSS applications are available under WSDBFetch/SOAPLAB. R:) Peter Rice wrote: > Jay wrote: >> I have a large file with sequences in fasta format. They have IDs. >> >> Is there any EMBOSS way to retrieve sequences by inputting a text file >> with >> a short listed IDs? > > With EMBOSS you can refer to sequences in the file: > > filename:id > > You can also put a list of these into a file, and use that with > @listfilename > > But this can be slow - it will read the file for each ID. You can also > index the file with dbxfasta (or dbifasta) as a private database then > define a database in your .embossrc file and use the dbname:id syntax > (again you can use a list file, but it will be much faster) > > Hope this helps. If you need more help setting up please ask again! > > regards, > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From db60 at st-andrews.ac.uk Tue Jun 10 14:49:51 2008 From: db60 at st-andrews.ac.uk (Daniel Barker) Date: Tue, 10 Jun 2008 19:49:51 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <484EB556.4050307@ebi.ac.uk> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> <484EB556.4050307@ebi.ac.uk> Message-ID: <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> Dear Jay, Are you simply trying to extract specific sequences from a Fasta-format file? The EMBOSS program to do it is seqret, or maybe seqretsplit: http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html As Peter Rice suggests, you can do stuff to speed the access up, but it'll work without that. Best regards, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From jboddu at uiuc.edu Tue Jun 10 16:15:35 2008 From: jboddu at uiuc.edu (Jay) Date: Tue, 10 Jun 2008 15:15:35 -0500 Subject: [EMBOSS] sequence retrieval In-Reply-To: <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> <484EB556.4050307@ebi.ac.uk> <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> Message-ID: <002401c8cb36$be5f5390$3b1dfab0$@edu> Daniel: I tried seqret in different ways. My problem is EMBOSS is not recognizing my master sequence file (which is in fasta form) as my private database. Even after I did the indexing using dbifasta. When seqret is asking me to input sequence(s), I am not able to figure out what exactly it accepts. I tried dbname:ID, dbname:@listfile. I also tried a crude way of copy pasting my master file and listfile in "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, embl:@listfile etc. These did not work. I am assuming that my master file is not being recognized as a private DB. I wanted to define my database in .embossrc file. I could not figure this out either. Jay -----Original Message----- From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] Sent: Tuesday, June 10, 2008 1:50 PM To: rls at ebi.ac.uk Cc: Peter Rice; Jay; emboss at lists.open-bio.org Subject: Re: [EMBOSS] sequence retrieval Dear Jay, Are you simply trying to extract specific sequences from a Fasta-format file? The EMBOSS program to do it is seqret, or maybe seqretsplit: http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html As Peter Rice suggests, you can do stuff to speed the access up, but it'll work without that. Best regards, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From sean.maceach at gmail.com Tue Jun 10 17:00:33 2008 From: sean.maceach at gmail.com (Sean MacEachern) Date: Tue, 10 Jun 2008 17:00:33 -0400 Subject: [EMBOSS] sequence retrieval In-Reply-To: <002401c8cb36$be5f5390$3b1dfab0$@edu> Message-ID: Hi Jay, Just wondering if you have considered the tools from NCBI. If you were to dload the blast bundle, I think blast-2.2.17 is the most current release, you can use formatdb to create a blastable database of your fasta seqs that you can use for blasting using one of the blast programs or retrieving using fastacmd. I'm not sure what emboss application you are attempting to use but you could probably use a for loop to automate some procedure Eg. For i in `cat seqIDs.txt`; do fastacmd -d blastdb -s $i > seq.fsa | primer3 -input seq.fsa -output $i_out.primers Depending on what you want to do something like that might work for you... Cheers, Sean On 6/10/08 4:15 PM, "Jay" wrote: > Daniel: > I tried seqret in different ways. > My problem is EMBOSS is not recognizing my master sequence file (which is in > fasta form) as my private database. Even after I did the indexing using > dbifasta. > When seqret is asking me to input sequence(s), I am not able to figure out > what exactly it accepts. > I tried dbname:ID, dbname:@listfile. > I also tried a crude way of copy pasting my master file and listfile in > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, > embl:@listfile etc. > These did not work. > I am assuming that my master file is not being recognized as a private DB. > I wanted to define my database in .embossrc file. I could not figure this > out either. > Jay > > -----Original Message----- > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] > Sent: Tuesday, June 10, 2008 1:50 PM > To: rls at ebi.ac.uk > Cc: Peter Rice; Jay; emboss at lists.open-bio.org > Subject: Re: [EMBOSS] sequence retrieval > > Dear Jay, > > Are you simply trying to extract specific sequences from a Fasta-format > file? The EMBOSS program to do it is seqret, or maybe seqretsplit: > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html > > As Peter Rice suggests, you can do stuff to speed the access up, but > it'll work without that. > > Best regards, > > Daniel From ztu at msi.umn.edu Tue Jun 10 17:54:06 2008 From: ztu at msi.umn.edu (Zheng Jin Tu) Date: Tue, 10 Jun 2008 16:54:06 -0500 (CDT) Subject: [EMBOSS] sequence retrieval In-Reply-To: References: Message-ID: This is very popular requirement from biological user community especially microarray user community. They have a list of id (affyid or access number) from microarray data analysis. Then they want sequence from fasta file such as Affymetrix Library xxx.sif file. In order to use EMBOSS, emboss admin needs to index database first. NCBI fastacmd is another option for getting sequence fast especially for last fasta sequence file such as nt or nr. A perl script will be useful for batch sequence retrival. It will read input file with list of IDs line-by-line then do: 1): fastacmd -d database -s ID >> outsequence # ncbi formatdb case 2): seqret ..... # EMBOSS case 3): Or just loop over sequence file with flag for find/not find by match id over fasta heading ">id ...". Then output sequence if flag is on if sequence is relative small especially in microarray case. Thanks, TU -------------------------------------------------- On Tue, 10 Jun 2008, Sean MacEachern wrote: > Hi Jay, > > Just wondering if you have considered the tools from NCBI. If you were to > dload the blast bundle, I think blast-2.2.17 is the most current release, > you can use formatdb to create a blastable database of your fasta seqs that > you can use for blasting using one of the blast programs or retrieving using > fastacmd. > > I'm not sure what emboss application you are attempting to use but you could > probably use a for loop to automate some procedure > > Eg. > > For i in `cat seqIDs.txt`; do fastacmd -d blastdb -s $i > seq.fsa | primer3 > -input seq.fsa -output $i_out.primers > > Depending on what you want to do something like that might work for you... > > Cheers, > Sean > > > On 6/10/08 4:15 PM, "Jay" wrote: > > > Daniel: > > I tried seqret in different ways. > > My problem is EMBOSS is not recognizing my master sequence file (which is in > > fasta form) as my private database. Even after I did the indexing using > > dbifasta. > > When seqret is asking me to input sequence(s), I am not able to figure out > > what exactly it accepts. > > I tried dbname:ID, dbname:@listfile. > > I also tried a crude way of copy pasting my master file and listfile in > > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, > > embl:@listfile etc. > > These did not work. > > I am assuming that my master file is not being recognized as a private DB. > > I wanted to define my database in .embossrc file. I could not figure this > > out either. > > Jay > > > > -----Original Message----- > > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] > > Sent: Tuesday, June 10, 2008 1:50 PM > > To: rls at ebi.ac.uk > > Cc: Peter Rice; Jay; emboss at lists.open-bio.org > > Subject: Re: [EMBOSS] sequence retrieval > > > > Dear Jay, > > > > Are you simply trying to extract specific sequences from a Fasta-format > > file? The EMBOSS program to do it is seqret, or maybe seqretsplit: > > > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html > > > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html > > > > As Peter Rice suggests, you can do stuff to speed the access up, but > > it'll work without that. > > > > Best regards, > > > > Daniel > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > -- ========================================================================== From david.bauer at bayerhealthcare.com Wed Jun 11 01:45:57 2008 From: david.bauer at bayerhealthcare.com (david.bauer at bayerhealthcare.com) Date: Wed, 11 Jun 2008 07:45:57 +0200 Subject: [EMBOSS] sequence retrieval In-Reply-To: <002401c8cb36$be5f5390$3b1dfab0$@edu> Message-ID: Hi, the database section of the adminguide http://emboss.sourceforge.net/docs/adminguide/node37.html describes all the emboss database indexing methods. There is also a specific chapter on fasta files http://emboss.sourceforge.net/docs/adminguide/node56.html which describes the different forms of fasta files. It is important to specify the correct type corresponding to the structure of the sequence header line. And also use full path names for the "Database directory" because relative path names like "." can cause problems on some systems. If you still get trouble, send me the section you have in .embossrc, so I can have a look at it. Hope this helps, Cheers, David. emboss-bounces at lists.open-bio.org schrieb am 10/06/2008 22:15:35: > Daniel: > I tried seqret in different ways. > My problem is EMBOSS is not recognizing my master sequence file (which is in > fasta form) as my private database. Even after I did the indexing using > dbifasta. > When seqret is asking me to input sequence(s), I am not able to figure out > what exactly it accepts. > I tried dbname:ID, dbname:@listfile. > I also tried a crude way of copy pasting my master file and listfile in > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, > embl:@listfile etc. > These did not work. > I am assuming that my master file is not being recognized as a private DB. > I wanted to define my database in .embossrc file. I could not figure this > out either. > Jay > > -----Original Message----- > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] > Sent: Tuesday, June 10, 2008 1:50 PM > To: rls at ebi.ac.uk > Cc: Peter Rice; Jay; emboss at lists.open-bio.org > Subject: Re: [EMBOSS] sequence retrieval > > Dear Jay, > > Are you simply trying to extract specific sequences from a Fasta-format > file? The EMBOSS program to do it is seqret, or maybe seqretsplit: > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html > > As Peter Rice suggests, you can do stuff to speed the access up, but > it'll work without that. > > Best regards, > > Daniel > > -- > Daniel Barker > http://bio.st-andrews.ac.uk/staff/db60.htm > The University of St Andrews is a charity registered in Scotland : > No SC013532 > > > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From db60 at st-andrews.ac.uk Wed Jun 11 06:49:12 2008 From: db60 at st-andrews.ac.uk (Daniel Barker) Date: Wed, 11 Jun 2008 11:49:12 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <002401c8cb36$be5f5390$3b1dfab0$@edu> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> <484EB556.4050307@ebi.ac.uk> <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> <002401c8cb36$be5f5390$3b1dfab0$@edu> Message-ID: <484FADA8.5090503@st-andrews.ac.uk> Dear Jay, My simple idea is just something like this: seqret @id_list.txt where id_list.txt is something like this: 23214.O_sativa_Nipponbare.fasta:Q9FXT4 23214.O_sativa_Nipponbare.fasta:Q2R8Z5 23214.O_sativa_Nipponbare.fasta:Q10AZ4 (23214.O_sativa_Nipponbare.fasta is a Fasta-format file in the current directory.) This certainly works - however, it may not really match what you're after. Best wishes, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 From orbitus007 at gmail.com Wed Jun 11 12:44:32 2008 From: orbitus007 at gmail.com (Rudy Aramayo) Date: Wed, 11 Jun 2008 09:44:32 -0700 Subject: [EMBOSS] Emboss Wrapper for Mac OS X Message-ID: <29555DCF-21ED-48A0-99DE-F6862BAD400B@neo.tamu.edu> Howdy! My name is Rodolfo Aramayo, I have written an application that wraps Emboss as well as any Unix application for the Mac. It will ONLY cover the Apple side of the spectrum (Mac OSX Leopard and higher) We distribute Task assignments to a computer with the AppleScript language allowing us to manipulate all the beautiful functionality of the Emboss package, including NCBI Blast, from scripts. This application is a generic wrapper tool for all Unix applications. It allows us to control the Emboss and/or other Bioinformatics Unix applications. With this tool we have also incorporated the ability to communicate with an XGrid (distributed computations), this is a way to send messages to every computer on an "XGrid" network (using AppleScript scripts of course) so that you can get a simple cluster of computers to perform a large task. For example, we distribute a local Blast search of entire genomes amongst an XGrid and collect each result into a single machine. Apple also has a powerful Automator Workflow feature (a wrapper for AppleScript), this allows users whom do not have any AppleScript experience to script with the modular components the application (like reading in data, or blasting data) with graphical drag and drop modules of the application. In this manner we have written iBioCAD to be presented at WWDC 2008, that is the Apple World Wide Developers Conference. Look for us soon. The product is NOT ready and we are still developing, we will be completing most of this project and I will be displaying a scientific poster regarding the structure of the application. Lets build a great wrapper to graphical display bioinformatics to the world, together. -Rodolfo Aramayo From john.walshaw at bbsrc.ac.uk Wed Jun 11 13:51:00 2008 From: john.walshaw at bbsrc.ac.uk (john walshaw (JIC)) Date: Wed, 11 Jun 2008 18:51:00 +0100 Subject: [EMBOSS] problem with unauthenticated Jemboss server Message-ID: Hello, I am trying to install an un-authenticated Jemboss server on Linux (RHEL4, on an AMD64 platform). I've managed this before on other RedHat flavours, and on Tru64. Everything appears to be ok in terms of the Jemboss service being deployed, which I can see on the Tomcat server via Axis. However, when I try and connect with my Jemboss client, I immediately get the "Check Settings" popup, even though the Public/Private server details appear correct. As expected, at no stage does a login dialogue appear. However, if I click OK on the Check Settings popup, then try and run an EMBOSS app, I get the popup: "Authentication failed/ The server wants a username and password ..." Can anybody help me diagnose the cause? The logs produced by the vanilla Tomcat installation aren't very helpful. Details are: EMBOSS 5.0.0 Tomcat 5.0.28 Axis 1.4 Sun Java 1.5.0.11.x86_64 kernel 2.6.9-42.ELsmp The installation is on a node ('node7') of a cluster behind a firewall. I'm running the client on the same host and another one behind the same firewall. When running configure, I specified --without-auth (and --with-thread=linux and --enable-64). When building Jemboss, I compiled the JembossServer and JembossFileServer classes (not the ...Auth.. equivalents). The relevant entries in the jemboss.properties file used by both server & client are: user.auth=false jemboss.server=true server.public=http://node7:8080/axis/services server.private=http://node7:8080/axis/services service.public=JembossServer service.private=JembossServer The above server details appear as expected in the Preferences -> Settings -> Servers dialogue of the Jemboss client. After starting Tomcat and deploying JembossServer, I can go to: http://node7:8080/axis/services/JembossServer using a browser on the same node or a different one on the cluster. I get the expected page ("JembossServer Hi there, this is an AXIS service! .... " etc). http://node7:8080/axis/happyaxis.jsp lists all the Needed Components, and all are present. All that is missing is one optional component, the XML Security class. http://node7:8080/axis/servlet/AxisServlet shows that both JembossServer and EmbreoFile have been added - they and all their methods are listed. If I run the Jemboss client on the same host as the server, it's still the same problem if I specify the servers as http://localhost:8080/axis/services Any help much appreciated, regards, John. Dr John Walshaw Department of Computational & Systems Biology John Innes Centre Colney Norwich NR4 7UH UK From maoj at helix.nih.gov Fri Jun 13 16:27:36 2008 From: maoj at helix.nih.gov (Jean Mao) Date: Fri, 13 Jun 2008 16:27:36 -0400 Subject: [EMBOSS] Question about seq fragments merge then align Message-ID: <4852D838.4010406@helix.nih.gov> Hi all, I would like to know which program(s) I should use to do the following, prefer in as few steps as possible: - find the overlap regions of multiple sequence fragments - merge them into one big sequence - align to a known sequence I found programs that only merge 2 sequences, not multiple sequences. Thanks you very much. Jean Mao From andrespinzon at gmail.com Tue Jun 17 15:47:17 2008 From: andrespinzon at gmail.com (Andres Pinzon) Date: Tue, 17 Jun 2008 14:47:17 -0500 Subject: [EMBOSS] notseq and fasta definition headers Message-ID: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> Hi, Im using notseq to obtain a subset of fasta seqs from a multiple fasta file: notseq -junkoutseq 1000-1.fasta -sequence 7135seqs.fasta -exclude @xaa.list.fasta -outseq leftSeqs.fast The output is correct, but notseq changes the definition in the fasta headers, so if the fasta header in "xaa.list.fasta" was: lcl|29855|ORF26673_6 the corresponding fasta header in sequence in 1000-1.fasta is: 29855 Is there a way to tell "notseq" to keep the original fasta headers intact? Thanks in advance, -- Andr?s Pinz?n cPhD http://bioinf.ibun.unal.edu.co/~apinzon/ Bioinformatics Center, Colombia EMBnet node http://bioinf.ibun.unal.edu.co Tel +57 3165000 ext 16961 Fax +571 3165415 Micology and Phytopathology Laboratory - Los Andes University. http://bioinf.uniandes.edu.co Tel +571 3394949 ext. 2768 From andrespinzon at gmail.com Tue Jun 17 15:49:59 2008 From: andrespinzon at gmail.com (Andres Pinzon) Date: Tue, 17 Jun 2008 14:49:59 -0500 Subject: [EMBOSS] notseq and fasta definition headers In-Reply-To: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> References: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> Message-ID: <8968fc7e0806171249x5b4b9ab1q851afb6318840a38@mail.gmail.com> Hi, Im using notseq to obtain a subset of fasta seqs from a multiple fasta file: notseq -junkoutseq 1000-1.fasta -sequence 7135seqs.fasta -exclude @xaa.list.fasta -outseq leftSeqs.fast The output is correct, but notseq changes the definition in the fasta headers, so if the fasta header in "xaa.list.fasta" was: lcl|29855|ORF26673_6 the corresponding fasta header in sequence in 1000-1.fasta is: 29855 Is there a way to tell "notseq" to keep the original fasta headers intact? Thanks in advance, -- Andr?s Pinz?n cPhD http://bioinf.ibun.unal.edu.co/~apinzon/ Bioinformatics Center, Colombia EMBnet node http://bioinf.ibun.unal.edu.co Tel +57 3165000 ext 16961 Fax +571 3165415 Micology and Phytopathology Laboratory - Los Andes University. http://bioinf.uniandes.edu.co Tel +571 3394949 ext. 2768 From pmr at ebi.ac.uk Tue Jun 17 16:28:47 2008 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 17 Jun 2008 21:28:47 +0100 Subject: [EMBOSS] notseq and fasta definition headers In-Reply-To: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> References: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> Message-ID: <48581E7F.40706@ebi.ac.uk> Andres Pinzon wrote: > The output is correct, but notseq changes the definition in the fasta > headers, so if the fasta header in "xaa.list.fasta" was: > > lcl|29855|ORF26673_6 > > the corresponding fasta header in sequence in 1000-1.fasta is: > > 29855 > > Is there a way to tell "notseq" to keep the original fasta headers intact? Yes. FASTA format is not simple ... we have seen many ways to hide extra information in the ID (EMBOSS recognizes NCBI id formats and parses out the ID 29855) and also in the description (we try to recognize conventions used by GCG and ACEDB) But you can also specify "pearson" format which reads the ID without parsing. Just add to the commandline: notseq -sf pearson Now you have another problem. This will not work for notseq!!! The exclude string in notseq is a pattern. In processing the pattern, some pattern characters are removed: whitespace ',' and ';' '|' So your exclude pattern cannot include any '|' chatracters. As a workaround, you can exclude "*ORF26673_6" and the IDs will be preserved. For the next release we will allow '|' characters. When notseq was first written there was a possibility to use regualr expressions, but now we only use simple text matching so the pipe characters are not a problem. Hope that helps Peter From jcohn at pngg.org Wed Jun 25 13:51:28 2008 From: jcohn at pngg.org (Josh Cohn) Date: Wed, 25 Jun 2008 13:51:28 -0400 Subject: [EMBOSS] einverted- file size limits? Message-ID: Hello, I am attempting to use einverted on a relatively large set of sequences. I've noticed that when I run just a few sequences, einverted seems to run just fine. However, when I use the same parameters on a large set of sequences, the program quits before it has finished analyzing all of the data. Are there known file size limits or sequence length limits for einverted? If so, how can I run large sequences (>300kb) or large numbers of sequences (1000+)? I'm running einverted from EMBOSS 5.0.0 on a Sun machine running Solaris 9 for SPARC. Thanks, Josh From jison at ebi.ac.uk Thu Jun 26 03:24:40 2008 From: jison at ebi.ac.uk (Jon Ison) Date: Thu, 26 Jun 2008 08:24:40 +0100 (BST) Subject: [EMBOSS] einverted- file size limits? In-Reply-To: References: Message-ID: <36190.84.92.187.247.1214465080.squirrel@webmail.ebi.ac.uk> Hi Josh The short answer is you need more memory and a faster computer. Check there are no system limits on memory usage (do an "unlimit" or some such). EMBOSS has no arbitrary memory limits, it is just that einverted uses full dynamic programming which is necessarily very memory and CPU intensive, especially for larger sequences. You could try running palindrome which does a similar thing and is is faster and less memory intensive. Cheers Jon > Hello, > > I am attempting to use einverted on a relatively large set of > sequences. I've noticed that when I run just a few sequences, einverted > seems to run just fine. However, when I use the same parameters on a > large set of sequences, the program quits before it has finished > analyzing all of the data. Are there known file size limits or sequence > length limits for einverted? If so, how can I run large sequences > (>300kb) or large numbers of sequences (1000+)? > > > > I'm running einverted from EMBOSS 5.0.0 on a Sun machine running Solaris > 9 for SPARC. > > > > Thanks, > > > > Josh > > > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From jboddu at uiuc.edu Tue Jun 10 14:50:12 2008 From: jboddu at uiuc.edu (Jay) Date: Tue, 10 Jun 2008 09:50:12 -0500 Subject: [EMBOSS] sequence retrieval Message-ID: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> Hi: I am brand new to EMBOSS and bioinformatics. I have a large file with sequences in fasta format. They have IDs. Is there any EMBOSS way to retrieve sequences by inputting a text file with a short listed IDs? Thanks Jay From pmr at ebi.ac.uk Tue Jun 10 16:11:50 2008 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 10 Jun 2008 17:11:50 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> Message-ID: <484EA7C6.2050904@ebi.ac.uk> Jay wrote: > I have a large file with sequences in fasta format. They have IDs. > > Is there any EMBOSS way to retrieve sequences by inputting a text file with > a short listed IDs? With EMBOSS you can refer to sequences in the file: filename:id You can also put a list of these into a file, and use that with @listfilename But this can be slow - it will read the file for each ID. You can also index the file with dbxfasta (or dbifasta) as a private database then define a database in your .embossrc file and use the dbname:id syntax (again you can use a list file, but it will be much faster) Hope this helps. If you need more help setting up please ask again! regards, Peter From rls at ebi.ac.uk Tue Jun 10 17:09:42 2008 From: rls at ebi.ac.uk (Rodrigo Lopez) Date: Tue, 10 Jun 2008 18:09:42 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <484EA7C6.2050904@ebi.ac.uk> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> Message-ID: <484EB556.4050307@ebi.ac.uk> Alternatively, look into dbfetch and wsdbfetch Web Services: http://www.ebi.ac.uk/dbfetch http://www.ebi.ac.uk/Tools/webservices All the EMBOSS applications are available under WSDBFetch/SOAPLAB. R:) Peter Rice wrote: > Jay wrote: >> I have a large file with sequences in fasta format. They have IDs. >> >> Is there any EMBOSS way to retrieve sequences by inputting a text file >> with >> a short listed IDs? > > With EMBOSS you can refer to sequences in the file: > > filename:id > > You can also put a list of these into a file, and use that with > @listfilename > > But this can be slow - it will read the file for each ID. You can also > index the file with dbxfasta (or dbifasta) as a private database then > define a database in your .embossrc file and use the dbname:id syntax > (again you can use a list file, but it will be much faster) > > Hope this helps. If you need more help setting up please ask again! > > regards, > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From db60 at st-andrews.ac.uk Tue Jun 10 18:49:51 2008 From: db60 at st-andrews.ac.uk (Daniel Barker) Date: Tue, 10 Jun 2008 19:49:51 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <484EB556.4050307@ebi.ac.uk> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> <484EB556.4050307@ebi.ac.uk> Message-ID: <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> Dear Jay, Are you simply trying to extract specific sequences from a Fasta-format file? The EMBOSS program to do it is seqret, or maybe seqretsplit: http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html As Peter Rice suggests, you can do stuff to speed the access up, but it'll work without that. Best regards, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From jboddu at uiuc.edu Tue Jun 10 20:15:35 2008 From: jboddu at uiuc.edu (Jay) Date: Tue, 10 Jun 2008 15:15:35 -0500 Subject: [EMBOSS] sequence retrieval In-Reply-To: <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> <484EB556.4050307@ebi.ac.uk> <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> Message-ID: <002401c8cb36$be5f5390$3b1dfab0$@edu> Daniel: I tried seqret in different ways. My problem is EMBOSS is not recognizing my master sequence file (which is in fasta form) as my private database. Even after I did the indexing using dbifasta. When seqret is asking me to input sequence(s), I am not able to figure out what exactly it accepts. I tried dbname:ID, dbname:@listfile. I also tried a crude way of copy pasting my master file and listfile in "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, embl:@listfile etc. These did not work. I am assuming that my master file is not being recognized as a private DB. I wanted to define my database in .embossrc file. I could not figure this out either. Jay -----Original Message----- From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] Sent: Tuesday, June 10, 2008 1:50 PM To: rls at ebi.ac.uk Cc: Peter Rice; Jay; emboss at lists.open-bio.org Subject: Re: [EMBOSS] sequence retrieval Dear Jay, Are you simply trying to extract specific sequences from a Fasta-format file? The EMBOSS program to do it is seqret, or maybe seqretsplit: http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html As Peter Rice suggests, you can do stuff to speed the access up, but it'll work without that. Best regards, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From sean.maceach at gmail.com Tue Jun 10 21:00:33 2008 From: sean.maceach at gmail.com (Sean MacEachern) Date: Tue, 10 Jun 2008 17:00:33 -0400 Subject: [EMBOSS] sequence retrieval In-Reply-To: <002401c8cb36$be5f5390$3b1dfab0$@edu> Message-ID: Hi Jay, Just wondering if you have considered the tools from NCBI. If you were to dload the blast bundle, I think blast-2.2.17 is the most current release, you can use formatdb to create a blastable database of your fasta seqs that you can use for blasting using one of the blast programs or retrieving using fastacmd. I'm not sure what emboss application you are attempting to use but you could probably use a for loop to automate some procedure Eg. For i in `cat seqIDs.txt`; do fastacmd -d blastdb -s $i > seq.fsa | primer3 -input seq.fsa -output $i_out.primers Depending on what you want to do something like that might work for you... Cheers, Sean On 6/10/08 4:15 PM, "Jay" wrote: > Daniel: > I tried seqret in different ways. > My problem is EMBOSS is not recognizing my master sequence file (which is in > fasta form) as my private database. Even after I did the indexing using > dbifasta. > When seqret is asking me to input sequence(s), I am not able to figure out > what exactly it accepts. > I tried dbname:ID, dbname:@listfile. > I also tried a crude way of copy pasting my master file and listfile in > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, > embl:@listfile etc. > These did not work. > I am assuming that my master file is not being recognized as a private DB. > I wanted to define my database in .embossrc file. I could not figure this > out either. > Jay > > -----Original Message----- > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] > Sent: Tuesday, June 10, 2008 1:50 PM > To: rls at ebi.ac.uk > Cc: Peter Rice; Jay; emboss at lists.open-bio.org > Subject: Re: [EMBOSS] sequence retrieval > > Dear Jay, > > Are you simply trying to extract specific sequences from a Fasta-format > file? The EMBOSS program to do it is seqret, or maybe seqretsplit: > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html > > As Peter Rice suggests, you can do stuff to speed the access up, but > it'll work without that. > > Best regards, > > Daniel From ztu at msi.umn.edu Tue Jun 10 21:54:06 2008 From: ztu at msi.umn.edu (Zheng Jin Tu) Date: Tue, 10 Jun 2008 16:54:06 -0500 (CDT) Subject: [EMBOSS] sequence retrieval In-Reply-To: References: Message-ID: This is very popular requirement from biological user community especially microarray user community. They have a list of id (affyid or access number) from microarray data analysis. Then they want sequence from fasta file such as Affymetrix Library xxx.sif file. In order to use EMBOSS, emboss admin needs to index database first. NCBI fastacmd is another option for getting sequence fast especially for last fasta sequence file such as nt or nr. A perl script will be useful for batch sequence retrival. It will read input file with list of IDs line-by-line then do: 1): fastacmd -d database -s ID >> outsequence # ncbi formatdb case 2): seqret ..... # EMBOSS case 3): Or just loop over sequence file with flag for find/not find by match id over fasta heading ">id ...". Then output sequence if flag is on if sequence is relative small especially in microarray case. Thanks, TU -------------------------------------------------- On Tue, 10 Jun 2008, Sean MacEachern wrote: > Hi Jay, > > Just wondering if you have considered the tools from NCBI. If you were to > dload the blast bundle, I think blast-2.2.17 is the most current release, > you can use formatdb to create a blastable database of your fasta seqs that > you can use for blasting using one of the blast programs or retrieving using > fastacmd. > > I'm not sure what emboss application you are attempting to use but you could > probably use a for loop to automate some procedure > > Eg. > > For i in `cat seqIDs.txt`; do fastacmd -d blastdb -s $i > seq.fsa | primer3 > -input seq.fsa -output $i_out.primers > > Depending on what you want to do something like that might work for you... > > Cheers, > Sean > > > On 6/10/08 4:15 PM, "Jay" wrote: > > > Daniel: > > I tried seqret in different ways. > > My problem is EMBOSS is not recognizing my master sequence file (which is in > > fasta form) as my private database. Even after I did the indexing using > > dbifasta. > > When seqret is asking me to input sequence(s), I am not able to figure out > > what exactly it accepts. > > I tried dbname:ID, dbname:@listfile. > > I also tried a crude way of copy pasting my master file and listfile in > > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, > > embl:@listfile etc. > > These did not work. > > I am assuming that my master file is not being recognized as a private DB. > > I wanted to define my database in .embossrc file. I could not figure this > > out either. > > Jay > > > > -----Original Message----- > > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] > > Sent: Tuesday, June 10, 2008 1:50 PM > > To: rls at ebi.ac.uk > > Cc: Peter Rice; Jay; emboss at lists.open-bio.org > > Subject: Re: [EMBOSS] sequence retrieval > > > > Dear Jay, > > > > Are you simply trying to extract specific sequences from a Fasta-format > > file? The EMBOSS program to do it is seqret, or maybe seqretsplit: > > > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html > > > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html > > > > As Peter Rice suggests, you can do stuff to speed the access up, but > > it'll work without that. > > > > Best regards, > > > > Daniel > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > -- ========================================================================== From david.bauer at bayerhealthcare.com Wed Jun 11 05:45:57 2008 From: david.bauer at bayerhealthcare.com (david.bauer at bayerhealthcare.com) Date: Wed, 11 Jun 2008 07:45:57 +0200 Subject: [EMBOSS] sequence retrieval In-Reply-To: <002401c8cb36$be5f5390$3b1dfab0$@edu> Message-ID: Hi, the database section of the adminguide http://emboss.sourceforge.net/docs/adminguide/node37.html describes all the emboss database indexing methods. There is also a specific chapter on fasta files http://emboss.sourceforge.net/docs/adminguide/node56.html which describes the different forms of fasta files. It is important to specify the correct type corresponding to the structure of the sequence header line. And also use full path names for the "Database directory" because relative path names like "." can cause problems on some systems. If you still get trouble, send me the section you have in .embossrc, so I can have a look at it. Hope this helps, Cheers, David. emboss-bounces at lists.open-bio.org schrieb am 10/06/2008 22:15:35: > Daniel: > I tried seqret in different ways. > My problem is EMBOSS is not recognizing my master sequence file (which is in > fasta form) as my private database. Even after I did the indexing using > dbifasta. > When seqret is asking me to input sequence(s), I am not able to figure out > what exactly it accepts. > I tried dbname:ID, dbname:@listfile. > I also tried a crude way of copy pasting my master file and listfile in > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID, > embl:@listfile etc. > These did not work. > I am assuming that my master file is not being recognized as a private DB. > I wanted to define my database in .embossrc file. I could not figure this > out either. > Jay > > -----Original Message----- > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk] > Sent: Tuesday, June 10, 2008 1:50 PM > To: rls at ebi.ac.uk > Cc: Peter Rice; Jay; emboss at lists.open-bio.org > Subject: Re: [EMBOSS] sequence retrieval > > Dear Jay, > > Are you simply trying to extract specific sequences from a Fasta-format > file? The EMBOSS program to do it is seqret, or maybe seqretsplit: > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html > > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html > > As Peter Rice suggests, you can do stuff to speed the access up, but > it'll work without that. > > Best regards, > > Daniel > > -- > Daniel Barker > http://bio.st-andrews.ac.uk/staff/db60.htm > The University of St Andrews is a charity registered in Scotland : > No SC013532 > > > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From db60 at st-andrews.ac.uk Wed Jun 11 10:49:12 2008 From: db60 at st-andrews.ac.uk (Daniel Barker) Date: Wed, 11 Jun 2008 11:49:12 +0100 Subject: [EMBOSS] sequence retrieval In-Reply-To: <002401c8cb36$be5f5390$3b1dfab0$@edu> References: <000c01c8cb09$4969bdf0$dc3d39d0$@edu> <484EA7C6.2050904@ebi.ac.uk> <484EB556.4050307@ebi.ac.uk> <1213123791.484ecccf46b78@webmail.st-andrews.ac.uk> <002401c8cb36$be5f5390$3b1dfab0$@edu> Message-ID: <484FADA8.5090503@st-andrews.ac.uk> Dear Jay, My simple idea is just something like this: seqret @id_list.txt where id_list.txt is something like this: 23214.O_sativa_Nipponbare.fasta:Q9FXT4 23214.O_sativa_Nipponbare.fasta:Q2R8Z5 23214.O_sativa_Nipponbare.fasta:Q10AZ4 (23214.O_sativa_Nipponbare.fasta is a Fasta-format file in the current directory.) This certainly works - however, it may not really match what you're after. Best wishes, Daniel -- Daniel Barker http://bio.st-andrews.ac.uk/staff/db60.htm The University of St Andrews is a charity registered in Scotland : No SC013532 From orbitus007 at gmail.com Wed Jun 11 16:44:32 2008 From: orbitus007 at gmail.com (Rudy Aramayo) Date: Wed, 11 Jun 2008 09:44:32 -0700 Subject: [EMBOSS] Emboss Wrapper for Mac OS X Message-ID: <29555DCF-21ED-48A0-99DE-F6862BAD400B@neo.tamu.edu> Howdy! My name is Rodolfo Aramayo, I have written an application that wraps Emboss as well as any Unix application for the Mac. It will ONLY cover the Apple side of the spectrum (Mac OSX Leopard and higher) We distribute Task assignments to a computer with the AppleScript language allowing us to manipulate all the beautiful functionality of the Emboss package, including NCBI Blast, from scripts. This application is a generic wrapper tool for all Unix applications. It allows us to control the Emboss and/or other Bioinformatics Unix applications. With this tool we have also incorporated the ability to communicate with an XGrid (distributed computations), this is a way to send messages to every computer on an "XGrid" network (using AppleScript scripts of course) so that you can get a simple cluster of computers to perform a large task. For example, we distribute a local Blast search of entire genomes amongst an XGrid and collect each result into a single machine. Apple also has a powerful Automator Workflow feature (a wrapper for AppleScript), this allows users whom do not have any AppleScript experience to script with the modular components the application (like reading in data, or blasting data) with graphical drag and drop modules of the application. In this manner we have written iBioCAD to be presented at WWDC 2008, that is the Apple World Wide Developers Conference. Look for us soon. The product is NOT ready and we are still developing, we will be completing most of this project and I will be displaying a scientific poster regarding the structure of the application. Lets build a great wrapper to graphical display bioinformatics to the world, together. -Rodolfo Aramayo From john.walshaw at bbsrc.ac.uk Wed Jun 11 17:51:00 2008 From: john.walshaw at bbsrc.ac.uk (john walshaw (JIC)) Date: Wed, 11 Jun 2008 18:51:00 +0100 Subject: [EMBOSS] problem with unauthenticated Jemboss server Message-ID: Hello, I am trying to install an un-authenticated Jemboss server on Linux (RHEL4, on an AMD64 platform). I've managed this before on other RedHat flavours, and on Tru64. Everything appears to be ok in terms of the Jemboss service being deployed, which I can see on the Tomcat server via Axis. However, when I try and connect with my Jemboss client, I immediately get the "Check Settings" popup, even though the Public/Private server details appear correct. As expected, at no stage does a login dialogue appear. However, if I click OK on the Check Settings popup, then try and run an EMBOSS app, I get the popup: "Authentication failed/ The server wants a username and password ..." Can anybody help me diagnose the cause? The logs produced by the vanilla Tomcat installation aren't very helpful. Details are: EMBOSS 5.0.0 Tomcat 5.0.28 Axis 1.4 Sun Java 1.5.0.11.x86_64 kernel 2.6.9-42.ELsmp The installation is on a node ('node7') of a cluster behind a firewall. I'm running the client on the same host and another one behind the same firewall. When running configure, I specified --without-auth (and --with-thread=linux and --enable-64). When building Jemboss, I compiled the JembossServer and JembossFileServer classes (not the ...Auth.. equivalents). The relevant entries in the jemboss.properties file used by both server & client are: user.auth=false jemboss.server=true server.public=http://node7:8080/axis/services server.private=http://node7:8080/axis/services service.public=JembossServer service.private=JembossServer The above server details appear as expected in the Preferences -> Settings -> Servers dialogue of the Jemboss client. After starting Tomcat and deploying JembossServer, I can go to: http://node7:8080/axis/services/JembossServer using a browser on the same node or a different one on the cluster. I get the expected page ("JembossServer Hi there, this is an AXIS service! .... " etc). http://node7:8080/axis/happyaxis.jsp lists all the Needed Components, and all are present. All that is missing is one optional component, the XML Security class. http://node7:8080/axis/servlet/AxisServlet shows that both JembossServer and EmbreoFile have been added - they and all their methods are listed. If I run the Jemboss client on the same host as the server, it's still the same problem if I specify the servers as http://localhost:8080/axis/services Any help much appreciated, regards, John. Dr John Walshaw Department of Computational & Systems Biology John Innes Centre Colney Norwich NR4 7UH UK From maoj at helix.nih.gov Fri Jun 13 20:27:36 2008 From: maoj at helix.nih.gov (Jean Mao) Date: Fri, 13 Jun 2008 16:27:36 -0400 Subject: [EMBOSS] Question about seq fragments merge then align Message-ID: <4852D838.4010406@helix.nih.gov> Hi all, I would like to know which program(s) I should use to do the following, prefer in as few steps as possible: - find the overlap regions of multiple sequence fragments - merge them into one big sequence - align to a known sequence I found programs that only merge 2 sequences, not multiple sequences. Thanks you very much. Jean Mao From andrespinzon at gmail.com Tue Jun 17 19:47:17 2008 From: andrespinzon at gmail.com (Andres Pinzon) Date: Tue, 17 Jun 2008 14:47:17 -0500 Subject: [EMBOSS] notseq and fasta definition headers Message-ID: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> Hi, Im using notseq to obtain a subset of fasta seqs from a multiple fasta file: notseq -junkoutseq 1000-1.fasta -sequence 7135seqs.fasta -exclude @xaa.list.fasta -outseq leftSeqs.fast The output is correct, but notseq changes the definition in the fasta headers, so if the fasta header in "xaa.list.fasta" was: lcl|29855|ORF26673_6 the corresponding fasta header in sequence in 1000-1.fasta is: 29855 Is there a way to tell "notseq" to keep the original fasta headers intact? Thanks in advance, -- Andr?s Pinz?n cPhD http://bioinf.ibun.unal.edu.co/~apinzon/ Bioinformatics Center, Colombia EMBnet node http://bioinf.ibun.unal.edu.co Tel +57 3165000 ext 16961 Fax +571 3165415 Micology and Phytopathology Laboratory - Los Andes University. http://bioinf.uniandes.edu.co Tel +571 3394949 ext. 2768 From andrespinzon at gmail.com Tue Jun 17 19:49:59 2008 From: andrespinzon at gmail.com (Andres Pinzon) Date: Tue, 17 Jun 2008 14:49:59 -0500 Subject: [EMBOSS] notseq and fasta definition headers In-Reply-To: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> References: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> Message-ID: <8968fc7e0806171249x5b4b9ab1q851afb6318840a38@mail.gmail.com> Hi, Im using notseq to obtain a subset of fasta seqs from a multiple fasta file: notseq -junkoutseq 1000-1.fasta -sequence 7135seqs.fasta -exclude @xaa.list.fasta -outseq leftSeqs.fast The output is correct, but notseq changes the definition in the fasta headers, so if the fasta header in "xaa.list.fasta" was: lcl|29855|ORF26673_6 the corresponding fasta header in sequence in 1000-1.fasta is: 29855 Is there a way to tell "notseq" to keep the original fasta headers intact? Thanks in advance, -- Andr?s Pinz?n cPhD http://bioinf.ibun.unal.edu.co/~apinzon/ Bioinformatics Center, Colombia EMBnet node http://bioinf.ibun.unal.edu.co Tel +57 3165000 ext 16961 Fax +571 3165415 Micology and Phytopathology Laboratory - Los Andes University. http://bioinf.uniandes.edu.co Tel +571 3394949 ext. 2768 From pmr at ebi.ac.uk Tue Jun 17 20:28:47 2008 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 17 Jun 2008 21:28:47 +0100 Subject: [EMBOSS] notseq and fasta definition headers In-Reply-To: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> References: <8968fc7e0806171247o40d2f7a7gd64618d567c125fd@mail.gmail.com> Message-ID: <48581E7F.40706@ebi.ac.uk> Andres Pinzon wrote: > The output is correct, but notseq changes the definition in the fasta > headers, so if the fasta header in "xaa.list.fasta" was: > > lcl|29855|ORF26673_6 > > the corresponding fasta header in sequence in 1000-1.fasta is: > > 29855 > > Is there a way to tell "notseq" to keep the original fasta headers intact? Yes. FASTA format is not simple ... we have seen many ways to hide extra information in the ID (EMBOSS recognizes NCBI id formats and parses out the ID 29855) and also in the description (we try to recognize conventions used by GCG and ACEDB) But you can also specify "pearson" format which reads the ID without parsing. Just add to the commandline: notseq -sf pearson Now you have another problem. This will not work for notseq!!! The exclude string in notseq is a pattern. In processing the pattern, some pattern characters are removed: whitespace ',' and ';' '|' So your exclude pattern cannot include any '|' chatracters. As a workaround, you can exclude "*ORF26673_6" and the IDs will be preserved. For the next release we will allow '|' characters. When notseq was first written there was a possibility to use regualr expressions, but now we only use simple text matching so the pipe characters are not a problem. Hope that helps Peter From jcohn at pngg.org Wed Jun 25 17:51:28 2008 From: jcohn at pngg.org (Josh Cohn) Date: Wed, 25 Jun 2008 13:51:28 -0400 Subject: [EMBOSS] einverted- file size limits? Message-ID: Hello, I am attempting to use einverted on a relatively large set of sequences. I've noticed that when I run just a few sequences, einverted seems to run just fine. However, when I use the same parameters on a large set of sequences, the program quits before it has finished analyzing all of the data. Are there known file size limits or sequence length limits for einverted? If so, how can I run large sequences (>300kb) or large numbers of sequences (1000+)? I'm running einverted from EMBOSS 5.0.0 on a Sun machine running Solaris 9 for SPARC. Thanks, Josh From jison at ebi.ac.uk Thu Jun 26 07:24:40 2008 From: jison at ebi.ac.uk (Jon Ison) Date: Thu, 26 Jun 2008 08:24:40 +0100 (BST) Subject: [EMBOSS] einverted- file size limits? In-Reply-To: References: Message-ID: <36190.84.92.187.247.1214465080.squirrel@webmail.ebi.ac.uk> Hi Josh The short answer is you need more memory and a faster computer. Check there are no system limits on memory usage (do an "unlimit" or some such). EMBOSS has no arbitrary memory limits, it is just that einverted uses full dynamic programming which is necessarily very memory and CPU intensive, especially for larger sequences. You could try running palindrome which does a similar thing and is is faster and less memory intensive. Cheers Jon > Hello, > > I am attempting to use einverted on a relatively large set of > sequences. I've noticed that when I run just a few sequences, einverted > seems to run just fine. However, when I use the same parameters on a > large set of sequences, the program quits before it has finished > analyzing all of the data. Are there known file size limits or sequence > length limits for einverted? If so, how can I run large sequences > (>300kb) or large numbers of sequences (1000+)? > > > > I'm running einverted from EMBOSS 5.0.0 on a Sun machine running Solaris > 9 for SPARC. > > > > Thanks, > > > > Josh > > > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss >