From georgios at biotek.uio.no Sat Oct 1 05:37:50 2011 From: georgios at biotek.uio.no (Georgios Magklaras) Date: Sat, 01 Oct 2011 11:37:50 +0200 Subject: [EMBOSS] Remote Genbank from NCBI? In-Reply-To: <4E8608C8.3090502@creighton.edu> References: <4E8608C8.3090502@creighton.edu> Message-ID: <4E86DF6E.6090505@biotek.uio.no> On 09/30/2011 08:22 PM, Ed Siefker wrote: > Is there a way to access NCBI Genbank remotely? > My emboss.default contains the following: > > DB tgb [ type: N method: srswww format: genbank > url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz" > dbalias: genbankrelease > fields: "sv des org key" > comment: "Genbank IDs" ] > > > However that server does not exist. I've looked on > the NCBI website for alternatives, but all I can find > is the ftp site. I've also read the EMBOSS admin guide. > The examples there use infobiogen.fr, which is also > closed. > > So what do people do for genbank access? I'd prefer > to avoid setting up a local database myself if I can. > Is there a list of genbank mirrors around somewhere? > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss Hi Ed, Yes, that SRS server does not exist anymore. The EBI SRS server is there and updated regurarly, but I am not sure if it offers Genbank. It does offer a full version of EMBL (nucleotide database, the contents of the release should mirror sync those of Genbank), so if you type the following in your emboss.default file, you will connect: DB embl [ type: N method: srswww format: embl url: "http://srs.ebi.ac.uk/cgi-bin/wgetz" fields: "id sv des org key" comment: "EMBL" ] Best regards, GM -- -- George Magklaras PhD RHCE no: 805008309135525 Senior Systems Engineer/IT Manager Biotek Center, University of Oslo EMBnet TMPC Chair http://folk.uio.no/georgios Tel: +47 22840535 From hrh at fmi.ch Sat Oct 1 07:15:41 2011 From: hrh at fmi.ch (Hans-Rudolf Hotz) Date: Sat, 01 Oct 2011 13:15:41 +0200 Subject: [EMBOSS] Remote Genbank from NCBI? In-Reply-To: <4E86DF6E.6090505@biotek.uio.no> References: <4E8608C8.3090502@creighton.edu> <4E86DF6E.6090505@biotek.uio.no> Message-ID: <4E86F65D.1090803@fmi.ch> On 10/01/2011 11:37 AM, Georgios Magklaras wrote: > On 09/30/2011 08:22 PM, Ed Siefker wrote: >> Is there a way to access NCBI Genbank remotely? >> My emboss.default contains the following: >> >> DB tgb [ type: N method: srswww format: genbank >> url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz" >> dbalias: genbankrelease >> fields: "sv des org key" >> comment: "Genbank IDs" ] >> >> >> However that server does not exist. I've looked on >> the NCBI website for alternatives, but all I can find >> is the ftp site. I've also read the EMBOSS admin guide. >> The examples there use infobiogen.fr, which is also >> closed. >> >> So what do people do for genbank access? I'd prefer >> to avoid setting up a local database myself if I can. >> Is there a list of genbank mirrors around somewhere? >> >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > Hi Ed, > > Yes, that SRS server does not exist anymore. The EBI SRS server is there > and updated regurarly, but I am not sure if it offers Genbank. It does > offer a full version of EMBL (nucleotide database, the contents of the > release should mirror sync those of Genbank), so if you type the > following in your emboss.default file, you will connect: > Try the SRS server at the 'DKFZ', see: http://www.dkfz.de/menu/cgi-bin/srs7.1.3.1/wgetz?-page+databanks or check the list of Public SRS Installations, see: http://www.biowisdom.com/download/srs-parser-and-software-downloads/public-srs-installations/ (although, I am not sure whether this list is actually still maintained) Regards, Hans > DB embl [ type: N method: srswww format: embl > url: "http://srs.ebi.ac.uk/cgi-bin/wgetz" > fields: "id sv des org key" > comment: "EMBL" ] > > Best regards, > GM > From fermaral1981 at gmail.com Tue Oct 4 09:38:22 2011 From: fermaral1981 at gmail.com (Fernando Martinez) Date: Tue, 4 Oct 2011 15:38:22 +0200 Subject: [EMBOSS] uniq sequences on a list Message-ID: Hi, I am trying to retrieve sequences from a multi-fasta file were there are identical sequences and i want to extract only the ones in my list, how can I do that? Example: Multi.fasta file: >seq1 atataga... >seq2 ttatggttca.. [...] >seq1 atataga... [...] and my list is: Multi.fasta:seq1 Multi.fasta:seq2 When I run "seqret @list -out out.fasta" I retrieve : >seq1 atataga... >seq2 ttatggttca... >seq1 atataga... And I only want to take seq1 an seq2, not two times seq1!! thanks From pmr at ebi.ac.uk Tue Oct 4 10:13:21 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 04 Oct 2011 15:13:21 +0100 Subject: [EMBOSS] uniq sequences on a list In-Reply-To: References: Message-ID: <4E8B1481.9060305@ebi.ac.uk> On 10/04/2011 02:38 PM, Fernando Martinez wrote: > Hi, I am trying to retrieve sequences from a multi-fasta file were there are > identical sequences and i want to extract only the ones in my list, how can > I do that? > Example: > > Multi.fasta file: > >> seq1 > atataga... >> seq2 > ttatggttca.. > [...] >> seq1 > atataga... > [...] > And I only want to take seq1 an seq2, not two times seq1!! If you really must start from that file .... as usual with EMBOSS there are several ways to do it 1. Index with dbifasta ---------------------- You can index with the older dbifasta program. This does not allow duplicate IDs so only one seq1 will be indexed. % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat simple -auto Then define a database in your .embossrc file: DB multi [ format: "fasta" method: "emblcd" type: "nucleotide" directory: "." ] Then replace "Multi.fasta" in your listfile with "multi" and you will have the sequences you want. 2. rewrite as single files in a new directory, then rewrite as one file % mkdir multi % seqret -ossingle -odsir multi Multi.fasta -auto % ls multi seq1.fasta seq2.fasta ... % cd multi seqret '*.fasta' ../Single.fasta (note: you do need the quotes around the wild card file name) this will give you a file Single.fasta in the original directory with only the last version of each id. 3. Write a new application --------------------------- Another approach is to write your own new application. A copy of seqret which keeps a table of ids and rejects any sequence with known ID will rewrite the file (in any format) with only the first occurrence of each id. We will add this to the next release. 4. ... there may be more ways, but these will be enough to solve your problem. Hope that helps, Peter Rice EMBOSS Team From fermaral1981 at gmail.com Wed Oct 5 06:52:43 2011 From: fermaral1981 at gmail.com (Fernando =?ISO-8859-1?Q?Mart=EDnez-Alberola?=) Date: Wed, 05 Oct 2011 12:52:43 +0200 Subject: [EMBOSS] uniq sequences on a list In-Reply-To: <4E8B1481.9060305@ebi.ac.uk> References: <4E8B1481.9060305@ebi.ac.uk> Message-ID: <1317811963.14315.1016.camel@cladonia2-desktop> El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribi?: > On 10/04/2011 02:38 PM, Fernando Martinez wrote: > > Hi, I am trying to retrieve sequences from a multi-fasta file were there are > > identical sequences and i want to extract only the ones in my list, how can > > I do that? > > Example: > > > > Multi.fasta file: > > > >> seq1 > > atataga... > >> seq2 > > ttatggttca.. > > [...] > >> seq1 > > atataga... > > [...] > > And I only want to take seq1 an seq2, not two times seq1!! > > If you really must start from that file .... as usual with EMBOSS there > are several ways to do it > > 1. Index with dbifasta > ---------------------- > > You can index with the older dbifasta program. This does not allow > duplicate IDs so only one seq1 will be indexed. > > % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat > simple -auto > > Then define a database in your .embossrc file: > > DB multi [ > format: "fasta" > method: "emblcd" > type: "nucleotide" > directory: "." > ] > > Then replace "Multi.fasta" in your listfile with "multi" and you will > have the sequences you want. > > > > 2. rewrite as single files in a new directory, then rewrite as one file > > % mkdir multi > % seqret -ossingle -odsir multi Multi.fasta -auto > % ls multi > seq1.fasta seq2.fasta ... > > % cd multi > seqret '*.fasta' ../Single.fasta > > (note: you do need the quotes around the wild card file name) > > this will give you a file Single.fasta in the original directory with > only the last version of each id. > > > > 3. Write a new application > --------------------------- > > Another approach is to write your own new application. A copy of seqret > which keeps a table of ids and rejects any sequence with known ID will > rewrite the file (in any format) with only the first occurrence of each > id. We will add this to the next release. > > > 4. ... there may be more ways, but these will be enough to solve your > problem. > > Hope that helps, > > Peter Rice > EMBOSS Team Thanks, your help was very useful, in particular the second mode. Best regards, Fernando From pmr at ebi.ac.uk Wed Oct 5 08:26:05 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 05 Oct 2011 13:26:05 +0100 Subject: [EMBOSS] Remote Genbank from NCBI? In-Reply-To: <4E8608C8.3090502@creighton.edu> References: <4E8608C8.3090502@creighton.edu> Message-ID: <4E8C4CDD.9010500@ebi.ac.uk> On 09/30/2011 07:22 PM, Ed Siefker wrote: > Is there a way to access NCBI Genbank remotely? The SRS server at DKFZ is defined as a server in EMBOSS 6.4.0.0 so you can use it with no extra definition: seqret dkfz:genbank:x13666 You can also use query fields, for example: seqret dkfz:genbank-id:x13776 seqret dkfz:genbank-acc:x13776 seqret 'dkfz:genbank-des:{amic & amir}' The release should also support the NCBI Entrez server but there is a bug in parsing the header. I will add a fix to the next patch. Then you could also use entrez:nucleotide:x13776 which reads the genbank format of the entry. Hope this helps, Peter Rice EMBOSS Team From ajb at ebi.ac.uk Wed Oct 5 11:05:30 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Wed, 5 Oct 2011 16:05:30 +0100 (BST) Subject: [EMBOSS] EMBOSS patch set 1-24 available. New mEMBOSS available. Message-ID: <39217.82.26.12.214.1317827130.squirrel@imap04.ebi.ac.uk> New bug-fix files are available for EMBOSS-6.4.0 and, for Windows users, a new version of mEMBOSS is available. The bugs fixed include those recently fixed (22-24), listed below, and all those fixed by previous patches (1-21). 1) UNIX As usual, the most convenient way of applying the bug-fixes is to apply the patch file: ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-24.gz to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code and recompiling/installing. (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch for instructions on using 'patch'). Alternatively, you can individually copy the patched files from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory if your system does not support 'patch'. 2) mEMBOSS The new version incorporates all new and previous bug-fixes. Uninstall your previous mEMBOSS installation and download and install the new setup file from: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.4-setup.exe Alan ----------------------------------------------------------------------- Fix 22. EMBOSS-6.4.0/emboss/diffseq.c EMBOSS-6.4.0/ajax/core/ajreport.c 14-Sep-2011: Diffseq reports insertions in the second sequence with a length 2 reversed region in the first sequence instead of a length 0 empty sequence. This bug was introduced in release 6.0.0 when reversed sequence features were updated. Fix 23. EMBOSS-6.4.0/ajax/core/ajindex.c 04-Oct-2011: Dbx index files from earlier releases do not include a type parameter to indicate an Identifier or Secondary index. The code to test index field names failed to define id and acc fields as Identifiers. This fix allows old indexes to work with EMBOSS 6.4.0. Fix 24. EMBOSS-6.4.0/ajax/core/ajfileio.c 05-Oct-2011: Trimming carriage controls from the ends of lines in a buffer failed when MacOSX-style characters are used and the line buffer is a reference counted string. An example on non-MacOSX systems was processing the data returned by the NCBI Entrez server. From aengus.stewart at cancer.org.uk Wed Oct 12 11:50:36 2011 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Wed, 12 Oct 2011 16:50:36 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E8608C8.3090502@creighton.edu> References: <4E8608C8.3090502@creighton.edu> Message-ID: <4E95B74C.10107@cancer.org.uk> Hi Folks, I couldnt see a command line option to do what I wanted ie return non-overlapping hits. This is best explained with some sample output. #======================================= # # Sequence: chr1_174353258_174354335 from: 1 to: 200 # HitCount: 9 # # Pattern_name Mismatch Pattern # pattern1 3 CC[AT](6)GG # # Complement: No # #======================================= Start End Strand Pattern_name Mismatch Sequence 54 63 + pattern1 3 GCCAAATAAG 55 64 + pattern1 . CCAAATAAGG 56 65 + pattern1 2 CAAATAAGGG 104 113 + pattern1 1 CCTAAATAAG 105 114 + pattern1 1 CTAAATAAGG 106 115 + pattern1 3 TAAATAAGGG 179 188 + pattern1 2 CCTTGCTTGG 190 199 + pattern1 3 CCGATTAGAG 191 200 + pattern1 3 CGATTAGAGC As you can see this is actually only 4 hits rather than the 9 reported. I can do this myself with another script but I was wondering if it could be an option? regards Aengus -- ----------------------------------------------------------------------- Aengus Stewart Tel: +44 (0)20 7269 3679 Head of Bioinformatics and BioStatistics CRUK London Research Institute Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered in England and Wales Company Registered Number: 4325234. Registered Charity Number: 1089464 and Scotland SC041666 Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. From pmr at ebi.ac.uk Wed Oct 12 20:02:08 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 13 Oct 2011 01:02:08 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E95B74C.10107@cancer.org.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> Message-ID: <4E962A80.1070907@ebi.ac.uk> On 12/10/2011 16:50, Aengus Stewart wrote: > Hi Folks, > > I couldnt see a command line option to do what I wanted ie return > non-overlapping hits. > > This is best explained with some sample output. > > #======================================= > # > # Sequence: chr1_174353258_174354335 from: 1 to: 200 > # HitCount: 9 > # > # Pattern_name Mismatch Pattern > # pattern1 3 CC[AT](6)GG > > As you can see this is actually only 4 hits rather than the 9 reported. Hmmm ... with that kind of pattern and 3 mismatches there are pretty sure to be overlapping matches. Trouble is, which matches would you want to keep? Your second match, for example, has 2 hits with 1 mismatch at 104..115 and 105..116 It should be possible to come up with patterns where the choice of 'best hit' complicates which hits are considered to overlap. Probably writing a script is your best bet as you can then control which hits are picked. We could try to write an application to remove overlapping features ... if someone can define how to select them. In this case, the mismatch number will be stored as a tag (feature qualifier) in the feature table and could be included in the selection criteria. Hope this helps ... and maybe sparks some ideas Peter Rice EMBOSS Team From jison at ebi.ac.uk Thu Oct 13 03:45:58 2011 From: jison at ebi.ac.uk (Jon Ison) Date: Thu, 13 Oct 2011 08:45:58 +0100 (BST) Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E962A80.1070907@ebi.ac.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> <4E962A80.1070907@ebi.ac.uk> Message-ID: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> Hi chaps (Aengus !) If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g. Start End Strand Pattern_name Mismatch Sequence 54 65 + pattern1 5 GCCAAATAAGGG 104 115 + pattern1 5 CCTAAATAAGGG 179 188 + pattern1 2 CCTTGCTTGG 190 200 + pattern1 6 CCGATTAGAGC Mismatch in this case is reporting the sum of mismatches from before. A column for number of (sub)matches would also be needed. Is that right Aengus? The above might give a useful result depending in the input pattern. It would I think be easy enough to implement. Cheers Jon > On 12/10/2011 16:50, Aengus Stewart wrote: >> Hi Folks, >> >> I couldnt see a command line option to do what I wanted ie return >> non-overlapping hits. >> >> This is best explained with some sample output. >> >> #======================================= >> # >> # Sequence: chr1_174353258_174354335 from: 1 to: 200 >> # HitCount: 9 >> # >> # Pattern_name Mismatch Pattern >> # pattern1 3 CC[AT](6)GG >> >> As you can see this is actually only 4 hits rather than the 9 reported. > > Hmmm ... with that kind of pattern and 3 mismatches there are pretty > sure to be overlapping matches. > > Trouble is, which matches would you want to keep? Your second match, for > example, has 2 hits with 1 mismatch at 104..115 and 105..116 > > It should be possible to come up with patterns where the choice of 'best > hit' complicates which hits are considered to overlap. > > Probably writing a script is your best bet as you can then control which > hits are picked. > > We could try to write an application to remove overlapping features ... > if someone can define how to select them. In this case, the mismatch > number will be stored as a tag (feature qualifier) in the feature table > and could be included in the selection criteria. > > Hope this helps ... and maybe sparks some ideas > > Peter Rice > EMBOSS Team > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From pmr at ebi.ac.uk Thu Oct 13 04:44:33 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 13 Oct 2011 09:44:33 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> <4E962A80.1070907@ebi.ac.uk> <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> Message-ID: <4E96A4F1.4050303@ebi.ac.uk> On 13/10/2011 08:45, Jon Ison wrote: > Hi chaps (Aengus !) > > If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for > a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g. > > Start End Strand Pattern_name Mismatch Sequence > 54 65 + pattern1 5 GCCAAATAAGGG > 104 115 + pattern1 5 CCTAAATAAGGG > 179 188 + pattern1 2 CCTTGCTTGG > 190 200 + pattern1 6 CCGATTAGAGC > > Mismatch in this case is reporting the sum of mismatches from before. A column for number of > (sub)matches would also be needed. Is that right Aengus? I'm not sure that adding the mismatches is sound. I'd assume just a best hit from the overlapping matches. > The above might give a useful result depending in the input pattern. It would I think be easy > enough to implement. This is a report output, so post-processing could be done by trimming the results before output using an associated qualifier. Still not sure how useful it would be, we need more feedback from other users on this one please! Peter Rice EMBOSS Team From aengus.stewart at cancer.org.uk Thu Oct 13 05:31:56 2011 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Thu, 13 Oct 2011 10:31:56 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E96A4F1.4050303@ebi.ac.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> <4E962A80.1070907@ebi.ac.uk> <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> <4E96A4F1.4050303@ebi.ac.uk> Message-ID: <4E96B00C.80806@cancer.org.uk> So Peter is right about what I want returned - the best match, but of course has pointed out the problem with having 2 best matches for the same region ( in this example 104-113, 105-114). However, it is still the case that the "real" result is 4 hits rather than 9. I dont know if my example is a special case or not so it would be good as Peter suggests if someone else has used fuzznuc in a similar way. Though surely if you include any mismatch at all for your pattern search then you automatically have this scenario of returning multiple results for the same location? Cheers Aengus On 13/10/11 09:44, Peter Rice wrote: > On 13/10/2011 08:45, Jon Ison wrote: >> Hi chaps (Aengus !) >> >> If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for >> a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g. >> >> Start End Strand Pattern_name Mismatch Sequence >> 54 65 + pattern1 5 GCCAAATAAGGG >> 104 115 + pattern1 5 CCTAAATAAGGG >> 179 188 + pattern1 2 CCTTGCTTGG >> 190 200 + pattern1 6 CCGATTAGAGC >> >> Mismatch in this case is reporting the sum of mismatches from before. A column for number of >> (sub)matches would also be needed. Is that right Aengus? > > I'm not sure that adding the mismatches is sound. I'd assume just a best > hit from the overlapping matches. > >> The above might give a useful result depending in the input pattern. It would I think be easy >> enough to implement. > > This is a report output, so post-processing could be done by trimming > the results before output using an associated qualifier. > > Still not sure how useful it would be, we need more feedback from other > users on this one please! > > Peter Rice > EMBOSS Team > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss -- ----------------------------------------------------------------------- Aengus Stewart Tel: +44 (0)20 7269 3679 Head of Bioinformatics and BioStatistics CRUK London Research Institute Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered in England and Wales Company Registered Number: 4325234. Registered Charity Number: 1089464 and Scotland SC041666 Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. From peter.r.hoyt at okstate.edu Thu Oct 20 12:29:47 2011 From: peter.r.hoyt at okstate.edu (peter.r.hoyt at okstate.edu) Date: Thu, 20 Oct 2011 11:29:47 -0500 Subject: [EMBOSS] Sorry, Windows problem. jEMBOSS upgrade to still says v 1.5 Message-ID: <4EA04C7B.70008@okstate.edu> So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. In my previous installs, I had used CygWin, but this time, could NOT get CygWin install to work (I really tried!). So I settled for the Windows setup file. Now I have jEMBOSS running fine, but it still says version 1.5. Is this correct? The jEMBOSS version hasn't changed? My next question coming soon! Pete From ajb at ebi.ac.uk Mon Oct 24 08:56:34 2011 From: ajb at ebi.ac.uk (Alan Bleasby) Date: Mon, 24 Oct 2011 13:56:34 +0100 Subject: [EMBOSS] Sorry, Windows problem. jEMBOSS upgrade to still says v 1.5 In-Reply-To: <4EA04C7B.70008@okstate.edu> References: <4EA04C7B.70008@okstate.edu> Message-ID: <4EA56082.1080307@ebi.ac.uk> Hello Pete, This one seems to have remained unanswered. Yes, the Jemboss version is still 1.5. The GUI has continued to be updated but the version number has remained the same for quite a while (an oversight on our part, thanks for highlighting it). Of course, to show the version of EMBOSS itself, you use the 'embossversion' application, which should show 6.4.0.4, within mEMBOSS, for the version you've installed. HTH Alan On 10/20/2011 05:29 PM, peter.r.hoyt at okstate.edu wrote: > So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. > In my previous installs, I had used CygWin, but this time, could NOT > get CygWin install to work (I really tried!). So I settled for the > Windows setup file. Now I have jEMBOSS running fine, but it still says > version 1.5. Is this correct? The jEMBOSS version hasn't changed? > > My next question coming soon! > > Pete > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From bernd.web at gmail.com Fri Oct 28 13:03:15 2011 From: bernd.web at gmail.com (Bernd Web) Date: Fri, 28 Oct 2011 19:03:15 +0200 Subject: [EMBOSS] fuzznuc pattern expansion Message-ID: Hi Using fuzznuc I get illegal pattern warnings. I realize what is going on: "You can use ambiguity codes for nucleic acid searches but not within [] or {} as they expand to bracketed counterparts. For example, "s" is expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is illegal." However, what I cannot find it how to suppress this expansion. Is this possible? We actually need to have these ambiguity remain as they are within [] as the input sequences can contain R, Y, B, N themselves for example. Thus, [GCS] is a pattern we actually want to be able to use. Kind regards, Bernd From pmr at ebi.ac.uk Sat Oct 29 13:06:13 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Sat, 29 Oct 2011 18:06:13 +0100 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: References: Message-ID: <4EAC3285.7080501@ebi.ac.uk> On 28/10/2011 18:03, Bernd Web wrote: > Hi > > Using fuzznuc I get illegal pattern warnings. I realize what is going on: > > "You can use ambiguity codes for nucleic acid searches but not within > [] or {} as they expand to bracketed counterparts. For example, "s" is > expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is > illegal." > > However, what I cannot find it how to suppress this expansion. Is this > possible? We actually need to have these ambiguity remain as they are > within [] as the input sequences can contain R, Y, B, N themselves for > example. Thus, [GCS] is a pattern we actually want to be able to use. That looks a reasonable suggestion. We can replace S with [GCS] directly. For the wider ambiguity codes, we can replace them with the subsets: B [TGCBSYK] D [TGADWRK] H [TCAHWYM] V [GCAVSRM] We can also allow 'C\S' to explicitly match CS in the input sequence by escaping the S to skip the automatic expansion. These changes can be added to the next release. Thanks for the idea. Peter Rice EMBOSS Team From georgios at biotek.uio.no Sat Oct 1 09:37:50 2011 From: georgios at biotek.uio.no (Georgios Magklaras) Date: Sat, 01 Oct 2011 11:37:50 +0200 Subject: [EMBOSS] Remote Genbank from NCBI? In-Reply-To: <4E8608C8.3090502@creighton.edu> References: <4E8608C8.3090502@creighton.edu> Message-ID: <4E86DF6E.6090505@biotek.uio.no> On 09/30/2011 08:22 PM, Ed Siefker wrote: > Is there a way to access NCBI Genbank remotely? > My emboss.default contains the following: > > DB tgb [ type: N method: srswww format: genbank > url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz" > dbalias: genbankrelease > fields: "sv des org key" > comment: "Genbank IDs" ] > > > However that server does not exist. I've looked on > the NCBI website for alternatives, but all I can find > is the ftp site. I've also read the EMBOSS admin guide. > The examples there use infobiogen.fr, which is also > closed. > > So what do people do for genbank access? I'd prefer > to avoid setting up a local database myself if I can. > Is there a list of genbank mirrors around somewhere? > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss Hi Ed, Yes, that SRS server does not exist anymore. The EBI SRS server is there and updated regurarly, but I am not sure if it offers Genbank. It does offer a full version of EMBL (nucleotide database, the contents of the release should mirror sync those of Genbank), so if you type the following in your emboss.default file, you will connect: DB embl [ type: N method: srswww format: embl url: "http://srs.ebi.ac.uk/cgi-bin/wgetz" fields: "id sv des org key" comment: "EMBL" ] Best regards, GM -- -- George Magklaras PhD RHCE no: 805008309135525 Senior Systems Engineer/IT Manager Biotek Center, University of Oslo EMBnet TMPC Chair http://folk.uio.no/georgios Tel: +47 22840535 From hrh at fmi.ch Sat Oct 1 11:15:41 2011 From: hrh at fmi.ch (Hans-Rudolf Hotz) Date: Sat, 01 Oct 2011 13:15:41 +0200 Subject: [EMBOSS] Remote Genbank from NCBI? In-Reply-To: <4E86DF6E.6090505@biotek.uio.no> References: <4E8608C8.3090502@creighton.edu> <4E86DF6E.6090505@biotek.uio.no> Message-ID: <4E86F65D.1090803@fmi.ch> On 10/01/2011 11:37 AM, Georgios Magklaras wrote: > On 09/30/2011 08:22 PM, Ed Siefker wrote: >> Is there a way to access NCBI Genbank remotely? >> My emboss.default contains the following: >> >> DB tgb [ type: N method: srswww format: genbank >> url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz" >> dbalias: genbankrelease >> fields: "sv des org key" >> comment: "Genbank IDs" ] >> >> >> However that server does not exist. I've looked on >> the NCBI website for alternatives, but all I can find >> is the ftp site. I've also read the EMBOSS admin guide. >> The examples there use infobiogen.fr, which is also >> closed. >> >> So what do people do for genbank access? I'd prefer >> to avoid setting up a local database myself if I can. >> Is there a list of genbank mirrors around somewhere? >> >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > Hi Ed, > > Yes, that SRS server does not exist anymore. The EBI SRS server is there > and updated regurarly, but I am not sure if it offers Genbank. It does > offer a full version of EMBL (nucleotide database, the contents of the > release should mirror sync those of Genbank), so if you type the > following in your emboss.default file, you will connect: > Try the SRS server at the 'DKFZ', see: http://www.dkfz.de/menu/cgi-bin/srs7.1.3.1/wgetz?-page+databanks or check the list of Public SRS Installations, see: http://www.biowisdom.com/download/srs-parser-and-software-downloads/public-srs-installations/ (although, I am not sure whether this list is actually still maintained) Regards, Hans > DB embl [ type: N method: srswww format: embl > url: "http://srs.ebi.ac.uk/cgi-bin/wgetz" > fields: "id sv des org key" > comment: "EMBL" ] > > Best regards, > GM > From fermaral1981 at gmail.com Tue Oct 4 13:38:22 2011 From: fermaral1981 at gmail.com (Fernando Martinez) Date: Tue, 4 Oct 2011 15:38:22 +0200 Subject: [EMBOSS] uniq sequences on a list Message-ID: Hi, I am trying to retrieve sequences from a multi-fasta file were there are identical sequences and i want to extract only the ones in my list, how can I do that? Example: Multi.fasta file: >seq1 atataga... >seq2 ttatggttca.. [...] >seq1 atataga... [...] and my list is: Multi.fasta:seq1 Multi.fasta:seq2 When I run "seqret @list -out out.fasta" I retrieve : >seq1 atataga... >seq2 ttatggttca... >seq1 atataga... And I only want to take seq1 an seq2, not two times seq1!! thanks From pmr at ebi.ac.uk Tue Oct 4 14:13:21 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 04 Oct 2011 15:13:21 +0100 Subject: [EMBOSS] uniq sequences on a list In-Reply-To: References: Message-ID: <4E8B1481.9060305@ebi.ac.uk> On 10/04/2011 02:38 PM, Fernando Martinez wrote: > Hi, I am trying to retrieve sequences from a multi-fasta file were there are > identical sequences and i want to extract only the ones in my list, how can > I do that? > Example: > > Multi.fasta file: > >> seq1 > atataga... >> seq2 > ttatggttca.. > [...] >> seq1 > atataga... > [...] > And I only want to take seq1 an seq2, not two times seq1!! If you really must start from that file .... as usual with EMBOSS there are several ways to do it 1. Index with dbifasta ---------------------- You can index with the older dbifasta program. This does not allow duplicate IDs so only one seq1 will be indexed. % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat simple -auto Then define a database in your .embossrc file: DB multi [ format: "fasta" method: "emblcd" type: "nucleotide" directory: "." ] Then replace "Multi.fasta" in your listfile with "multi" and you will have the sequences you want. 2. rewrite as single files in a new directory, then rewrite as one file % mkdir multi % seqret -ossingle -odsir multi Multi.fasta -auto % ls multi seq1.fasta seq2.fasta ... % cd multi seqret '*.fasta' ../Single.fasta (note: you do need the quotes around the wild card file name) this will give you a file Single.fasta in the original directory with only the last version of each id. 3. Write a new application --------------------------- Another approach is to write your own new application. A copy of seqret which keeps a table of ids and rejects any sequence with known ID will rewrite the file (in any format) with only the first occurrence of each id. We will add this to the next release. 4. ... there may be more ways, but these will be enough to solve your problem. Hope that helps, Peter Rice EMBOSS Team From fermaral1981 at gmail.com Wed Oct 5 10:52:43 2011 From: fermaral1981 at gmail.com (Fernando =?ISO-8859-1?Q?Mart=EDnez-Alberola?=) Date: Wed, 05 Oct 2011 12:52:43 +0200 Subject: [EMBOSS] uniq sequences on a list In-Reply-To: <4E8B1481.9060305@ebi.ac.uk> References: <4E8B1481.9060305@ebi.ac.uk> Message-ID: <1317811963.14315.1016.camel@cladonia2-desktop> El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribi?: > On 10/04/2011 02:38 PM, Fernando Martinez wrote: > > Hi, I am trying to retrieve sequences from a multi-fasta file were there are > > identical sequences and i want to extract only the ones in my list, how can > > I do that? > > Example: > > > > Multi.fasta file: > > > >> seq1 > > atataga... > >> seq2 > > ttatggttca.. > > [...] > >> seq1 > > atataga... > > [...] > > And I only want to take seq1 an seq2, not two times seq1!! > > If you really must start from that file .... as usual with EMBOSS there > are several ways to do it > > 1. Index with dbifasta > ---------------------- > > You can index with the older dbifasta program. This does not allow > duplicate IDs so only one seq1 will be indexed. > > % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat > simple -auto > > Then define a database in your .embossrc file: > > DB multi [ > format: "fasta" > method: "emblcd" > type: "nucleotide" > directory: "." > ] > > Then replace "Multi.fasta" in your listfile with "multi" and you will > have the sequences you want. > > > > 2. rewrite as single files in a new directory, then rewrite as one file > > % mkdir multi > % seqret -ossingle -odsir multi Multi.fasta -auto > % ls multi > seq1.fasta seq2.fasta ... > > % cd multi > seqret '*.fasta' ../Single.fasta > > (note: you do need the quotes around the wild card file name) > > this will give you a file Single.fasta in the original directory with > only the last version of each id. > > > > 3. Write a new application > --------------------------- > > Another approach is to write your own new application. A copy of seqret > which keeps a table of ids and rejects any sequence with known ID will > rewrite the file (in any format) with only the first occurrence of each > id. We will add this to the next release. > > > 4. ... there may be more ways, but these will be enough to solve your > problem. > > Hope that helps, > > Peter Rice > EMBOSS Team Thanks, your help was very useful, in particular the second mode. Best regards, Fernando From pmr at ebi.ac.uk Wed Oct 5 12:26:05 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 05 Oct 2011 13:26:05 +0100 Subject: [EMBOSS] Remote Genbank from NCBI? In-Reply-To: <4E8608C8.3090502@creighton.edu> References: <4E8608C8.3090502@creighton.edu> Message-ID: <4E8C4CDD.9010500@ebi.ac.uk> On 09/30/2011 07:22 PM, Ed Siefker wrote: > Is there a way to access NCBI Genbank remotely? The SRS server at DKFZ is defined as a server in EMBOSS 6.4.0.0 so you can use it with no extra definition: seqret dkfz:genbank:x13666 You can also use query fields, for example: seqret dkfz:genbank-id:x13776 seqret dkfz:genbank-acc:x13776 seqret 'dkfz:genbank-des:{amic & amir}' The release should also support the NCBI Entrez server but there is a bug in parsing the header. I will add a fix to the next patch. Then you could also use entrez:nucleotide:x13776 which reads the genbank format of the entry. Hope this helps, Peter Rice EMBOSS Team From ajb at ebi.ac.uk Wed Oct 5 15:05:30 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Wed, 5 Oct 2011 16:05:30 +0100 (BST) Subject: [EMBOSS] EMBOSS patch set 1-24 available. New mEMBOSS available. Message-ID: <39217.82.26.12.214.1317827130.squirrel@imap04.ebi.ac.uk> New bug-fix files are available for EMBOSS-6.4.0 and, for Windows users, a new version of mEMBOSS is available. The bugs fixed include those recently fixed (22-24), listed below, and all those fixed by previous patches (1-21). 1) UNIX As usual, the most convenient way of applying the bug-fixes is to apply the patch file: ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-24.gz to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code and recompiling/installing. (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch for instructions on using 'patch'). Alternatively, you can individually copy the patched files from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory if your system does not support 'patch'. 2) mEMBOSS The new version incorporates all new and previous bug-fixes. Uninstall your previous mEMBOSS installation and download and install the new setup file from: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.4-setup.exe Alan ----------------------------------------------------------------------- Fix 22. EMBOSS-6.4.0/emboss/diffseq.c EMBOSS-6.4.0/ajax/core/ajreport.c 14-Sep-2011: Diffseq reports insertions in the second sequence with a length 2 reversed region in the first sequence instead of a length 0 empty sequence. This bug was introduced in release 6.0.0 when reversed sequence features were updated. Fix 23. EMBOSS-6.4.0/ajax/core/ajindex.c 04-Oct-2011: Dbx index files from earlier releases do not include a type parameter to indicate an Identifier or Secondary index. The code to test index field names failed to define id and acc fields as Identifiers. This fix allows old indexes to work with EMBOSS 6.4.0. Fix 24. EMBOSS-6.4.0/ajax/core/ajfileio.c 05-Oct-2011: Trimming carriage controls from the ends of lines in a buffer failed when MacOSX-style characters are used and the line buffer is a reference counted string. An example on non-MacOSX systems was processing the data returned by the NCBI Entrez server. From aengus.stewart at cancer.org.uk Wed Oct 12 15:50:36 2011 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Wed, 12 Oct 2011 16:50:36 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E8608C8.3090502@creighton.edu> References: <4E8608C8.3090502@creighton.edu> Message-ID: <4E95B74C.10107@cancer.org.uk> Hi Folks, I couldnt see a command line option to do what I wanted ie return non-overlapping hits. This is best explained with some sample output. #======================================= # # Sequence: chr1_174353258_174354335 from: 1 to: 200 # HitCount: 9 # # Pattern_name Mismatch Pattern # pattern1 3 CC[AT](6)GG # # Complement: No # #======================================= Start End Strand Pattern_name Mismatch Sequence 54 63 + pattern1 3 GCCAAATAAG 55 64 + pattern1 . CCAAATAAGG 56 65 + pattern1 2 CAAATAAGGG 104 113 + pattern1 1 CCTAAATAAG 105 114 + pattern1 1 CTAAATAAGG 106 115 + pattern1 3 TAAATAAGGG 179 188 + pattern1 2 CCTTGCTTGG 190 199 + pattern1 3 CCGATTAGAG 191 200 + pattern1 3 CGATTAGAGC As you can see this is actually only 4 hits rather than the 9 reported. I can do this myself with another script but I was wondering if it could be an option? regards Aengus -- ----------------------------------------------------------------------- Aengus Stewart Tel: +44 (0)20 7269 3679 Head of Bioinformatics and BioStatistics CRUK London Research Institute Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered in England and Wales Company Registered Number: 4325234. Registered Charity Number: 1089464 and Scotland SC041666 Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. From pmr at ebi.ac.uk Thu Oct 13 00:02:08 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 13 Oct 2011 01:02:08 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E95B74C.10107@cancer.org.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> Message-ID: <4E962A80.1070907@ebi.ac.uk> On 12/10/2011 16:50, Aengus Stewart wrote: > Hi Folks, > > I couldnt see a command line option to do what I wanted ie return > non-overlapping hits. > > This is best explained with some sample output. > > #======================================= > # > # Sequence: chr1_174353258_174354335 from: 1 to: 200 > # HitCount: 9 > # > # Pattern_name Mismatch Pattern > # pattern1 3 CC[AT](6)GG > > As you can see this is actually only 4 hits rather than the 9 reported. Hmmm ... with that kind of pattern and 3 mismatches there are pretty sure to be overlapping matches. Trouble is, which matches would you want to keep? Your second match, for example, has 2 hits with 1 mismatch at 104..115 and 105..116 It should be possible to come up with patterns where the choice of 'best hit' complicates which hits are considered to overlap. Probably writing a script is your best bet as you can then control which hits are picked. We could try to write an application to remove overlapping features ... if someone can define how to select them. In this case, the mismatch number will be stored as a tag (feature qualifier) in the feature table and could be included in the selection criteria. Hope this helps ... and maybe sparks some ideas Peter Rice EMBOSS Team From jison at ebi.ac.uk Thu Oct 13 07:45:58 2011 From: jison at ebi.ac.uk (Jon Ison) Date: Thu, 13 Oct 2011 08:45:58 +0100 (BST) Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E962A80.1070907@ebi.ac.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> <4E962A80.1070907@ebi.ac.uk> Message-ID: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> Hi chaps (Aengus !) If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g. Start End Strand Pattern_name Mismatch Sequence 54 65 + pattern1 5 GCCAAATAAGGG 104 115 + pattern1 5 CCTAAATAAGGG 179 188 + pattern1 2 CCTTGCTTGG 190 200 + pattern1 6 CCGATTAGAGC Mismatch in this case is reporting the sum of mismatches from before. A column for number of (sub)matches would also be needed. Is that right Aengus? The above might give a useful result depending in the input pattern. It would I think be easy enough to implement. Cheers Jon > On 12/10/2011 16:50, Aengus Stewart wrote: >> Hi Folks, >> >> I couldnt see a command line option to do what I wanted ie return >> non-overlapping hits. >> >> This is best explained with some sample output. >> >> #======================================= >> # >> # Sequence: chr1_174353258_174354335 from: 1 to: 200 >> # HitCount: 9 >> # >> # Pattern_name Mismatch Pattern >> # pattern1 3 CC[AT](6)GG >> >> As you can see this is actually only 4 hits rather than the 9 reported. > > Hmmm ... with that kind of pattern and 3 mismatches there are pretty > sure to be overlapping matches. > > Trouble is, which matches would you want to keep? Your second match, for > example, has 2 hits with 1 mismatch at 104..115 and 105..116 > > It should be possible to come up with patterns where the choice of 'best > hit' complicates which hits are considered to overlap. > > Probably writing a script is your best bet as you can then control which > hits are picked. > > We could try to write an application to remove overlapping features ... > if someone can define how to select them. In this case, the mismatch > number will be stored as a tag (feature qualifier) in the feature table > and could be included in the selection criteria. > > Hope this helps ... and maybe sparks some ideas > > Peter Rice > EMBOSS Team > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From pmr at ebi.ac.uk Thu Oct 13 08:44:33 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 13 Oct 2011 09:44:33 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> <4E962A80.1070907@ebi.ac.uk> <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> Message-ID: <4E96A4F1.4050303@ebi.ac.uk> On 13/10/2011 08:45, Jon Ison wrote: > Hi chaps (Aengus !) > > If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for > a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g. > > Start End Strand Pattern_name Mismatch Sequence > 54 65 + pattern1 5 GCCAAATAAGGG > 104 115 + pattern1 5 CCTAAATAAGGG > 179 188 + pattern1 2 CCTTGCTTGG > 190 200 + pattern1 6 CCGATTAGAGC > > Mismatch in this case is reporting the sum of mismatches from before. A column for number of > (sub)matches would also be needed. Is that right Aengus? I'm not sure that adding the mismatches is sound. I'd assume just a best hit from the overlapping matches. > The above might give a useful result depending in the input pattern. It would I think be easy > enough to implement. This is a report output, so post-processing could be done by trimming the results before output using an associated qualifier. Still not sure how useful it would be, we need more feedback from other users on this one please! Peter Rice EMBOSS Team From aengus.stewart at cancer.org.uk Thu Oct 13 09:31:56 2011 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Thu, 13 Oct 2011 10:31:56 +0100 Subject: [EMBOSS] non-overlapping matches in fuzznuc? In-Reply-To: <4E96A4F1.4050303@ebi.ac.uk> References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk> <4E962A80.1070907@ebi.ac.uk> <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk> <4E96A4F1.4050303@ebi.ac.uk> Message-ID: <4E96B00C.80806@cancer.org.uk> So Peter is right about what I want returned - the best match, but of course has pointed out the problem with having 2 best matches for the same region ( in this example 104-113, 105-114). However, it is still the case that the "real" result is 4 hits rather than 9. I dont know if my example is a special case or not so it would be good as Peter suggests if someone else has used fuzznuc in a similar way. Though surely if you include any mismatch at all for your pattern search then you automatically have this scenario of returning multiple results for the same location? Cheers Aengus On 13/10/11 09:44, Peter Rice wrote: > On 13/10/2011 08:45, Jon Ison wrote: >> Hi chaps (Aengus !) >> >> If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for >> a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g. >> >> Start End Strand Pattern_name Mismatch Sequence >> 54 65 + pattern1 5 GCCAAATAAGGG >> 104 115 + pattern1 5 CCTAAATAAGGG >> 179 188 + pattern1 2 CCTTGCTTGG >> 190 200 + pattern1 6 CCGATTAGAGC >> >> Mismatch in this case is reporting the sum of mismatches from before. A column for number of >> (sub)matches would also be needed. Is that right Aengus? > > I'm not sure that adding the mismatches is sound. I'd assume just a best > hit from the overlapping matches. > >> The above might give a useful result depending in the input pattern. It would I think be easy >> enough to implement. > > This is a report output, so post-processing could be done by trimming > the results before output using an associated qualifier. > > Still not sure how useful it would be, we need more feedback from other > users on this one please! > > Peter Rice > EMBOSS Team > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss -- ----------------------------------------------------------------------- Aengus Stewart Tel: +44 (0)20 7269 3679 Head of Bioinformatics and BioStatistics CRUK London Research Institute Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered in England and Wales Company Registered Number: 4325234. Registered Charity Number: 1089464 and Scotland SC041666 Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. From peter.r.hoyt at okstate.edu Thu Oct 20 16:29:47 2011 From: peter.r.hoyt at okstate.edu (peter.r.hoyt at okstate.edu) Date: Thu, 20 Oct 2011 11:29:47 -0500 Subject: [EMBOSS] Sorry, Windows problem. jEMBOSS upgrade to still says v 1.5 Message-ID: <4EA04C7B.70008@okstate.edu> So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. In my previous installs, I had used CygWin, but this time, could NOT get CygWin install to work (I really tried!). So I settled for the Windows setup file. Now I have jEMBOSS running fine, but it still says version 1.5. Is this correct? The jEMBOSS version hasn't changed? My next question coming soon! Pete From ajb at ebi.ac.uk Mon Oct 24 12:56:34 2011 From: ajb at ebi.ac.uk (Alan Bleasby) Date: Mon, 24 Oct 2011 13:56:34 +0100 Subject: [EMBOSS] Sorry, Windows problem. jEMBOSS upgrade to still says v 1.5 In-Reply-To: <4EA04C7B.70008@okstate.edu> References: <4EA04C7B.70008@okstate.edu> Message-ID: <4EA56082.1080307@ebi.ac.uk> Hello Pete, This one seems to have remained unanswered. Yes, the Jemboss version is still 1.5. The GUI has continued to be updated but the version number has remained the same for quite a while (an oversight on our part, thanks for highlighting it). Of course, to show the version of EMBOSS itself, you use the 'embossversion' application, which should show 6.4.0.4, within mEMBOSS, for the version you've installed. HTH Alan On 10/20/2011 05:29 PM, peter.r.hoyt at okstate.edu wrote: > So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. > In my previous installs, I had used CygWin, but this time, could NOT > get CygWin install to work (I really tried!). So I settled for the > Windows setup file. Now I have jEMBOSS running fine, but it still says > version 1.5. Is this correct? The jEMBOSS version hasn't changed? > > My next question coming soon! > > Pete > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From bernd.web at gmail.com Fri Oct 28 17:03:15 2011 From: bernd.web at gmail.com (Bernd Web) Date: Fri, 28 Oct 2011 19:03:15 +0200 Subject: [EMBOSS] fuzznuc pattern expansion Message-ID: Hi Using fuzznuc I get illegal pattern warnings. I realize what is going on: "You can use ambiguity codes for nucleic acid searches but not within [] or {} as they expand to bracketed counterparts. For example, "s" is expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is illegal." However, what I cannot find it how to suppress this expansion. Is this possible? We actually need to have these ambiguity remain as they are within [] as the input sequences can contain R, Y, B, N themselves for example. Thus, [GCS] is a pattern we actually want to be able to use. Kind regards, Bernd From pmr at ebi.ac.uk Sat Oct 29 17:06:13 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Sat, 29 Oct 2011 18:06:13 +0100 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: References: Message-ID: <4EAC3285.7080501@ebi.ac.uk> On 28/10/2011 18:03, Bernd Web wrote: > Hi > > Using fuzznuc I get illegal pattern warnings. I realize what is going on: > > "You can use ambiguity codes for nucleic acid searches but not within > [] or {} as they expand to bracketed counterparts. For example, "s" is > expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is > illegal." > > However, what I cannot find it how to suppress this expansion. Is this > possible? We actually need to have these ambiguity remain as they are > within [] as the input sequences can contain R, Y, B, N themselves for > example. Thus, [GCS] is a pattern we actually want to be able to use. That looks a reasonable suggestion. We can replace S with [GCS] directly. For the wider ambiguity codes, we can replace them with the subsets: B [TGCBSYK] D [TGADWRK] H [TCAHWYM] V [GCAVSRM] We can also allow 'C\S' to explicitly match CS in the input sequence by escaping the S to skip the automatic expansion. These changes can be added to the next release. Thanks for the idea. Peter Rice EMBOSS Team