From georgios at biotek.uio.no  Sat Oct  1 05:37:50 2011
From: georgios at biotek.uio.no (Georgios Magklaras)
Date: Sat, 01 Oct 2011 11:37:50 +0200
Subject: [EMBOSS] Remote Genbank from NCBI?
In-Reply-To: <4E8608C8.3090502@creighton.edu>
References: <4E8608C8.3090502@creighton.edu>
Message-ID: <4E86DF6E.6090505@biotek.uio.no>

On 09/30/2011 08:22 PM, Ed Siefker wrote:
> Is there a way to access NCBI Genbank remotely?
> My emboss.default contains the following:
>
> DB tgb [ type: N method: srswww format: genbank
>    url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz"
>    dbalias: genbankrelease
>    fields: "sv des org key"
>    comment: "Genbank IDs" ]
>
>
> However that server does not exist.  I've looked on
> the NCBI website for alternatives, but all I can find
> is the ftp site.  I've also read the EMBOSS admin guide.
> The examples there use infobiogen.fr, which is also
> closed.
>
> So what do people do for genbank access?  I'd prefer
> to avoid setting up a local database myself if I can.
> Is there a list of genbank mirrors around somewhere?
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
Hi Ed,

Yes, that SRS server does not exist anymore. The EBI SRS server is there 
and updated regurarly, but I am not sure if it offers Genbank. It does 
offer a full version of EMBL (nucleotide database, the contents of the 
release should mirror sync those of Genbank), so if you type the 
following in your emboss.default file, you will connect:

DB embl [ type: N method: srswww format: embl
    url: "http://srs.ebi.ac.uk/cgi-bin/wgetz"
    fields: "id sv des org key"
    comment: "EMBL" ]

Best regards,
GM

-- 
-- 
George Magklaras PhD
RHCE no: 805008309135525

Senior Systems Engineer/IT Manager
Biotek Center, University of Oslo
EMBnet TMPC Chair

http://folk.uio.no/georgios

Tel: +47 22840535


From hrh at fmi.ch  Sat Oct  1 07:15:41 2011
From: hrh at fmi.ch (Hans-Rudolf Hotz)
Date: Sat, 01 Oct 2011 13:15:41 +0200
Subject: [EMBOSS] Remote Genbank from NCBI?
In-Reply-To: <4E86DF6E.6090505@biotek.uio.no>
References: <4E8608C8.3090502@creighton.edu> <4E86DF6E.6090505@biotek.uio.no>
Message-ID: <4E86F65D.1090803@fmi.ch>


On 10/01/2011 11:37 AM, Georgios Magklaras wrote:
> On 09/30/2011 08:22 PM, Ed Siefker wrote:
>> Is there a way to access NCBI Genbank remotely?
>> My emboss.default contains the following:
>>
>> DB tgb [ type: N method: srswww format: genbank
>> url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz"
>> dbalias: genbankrelease
>> fields: "sv des org key"
>> comment: "Genbank IDs" ]
>>
>>
>> However that server does not exist. I've looked on
>> the NCBI website for alternatives, but all I can find
>> is the ftp site. I've also read the EMBOSS admin guide.
>> The examples there use infobiogen.fr, which is also
>> closed.
>>
>> So what do people do for genbank access? I'd prefer
>> to avoid setting up a local database myself if I can.
>> Is there a list of genbank mirrors around somewhere?
>>
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
> Hi Ed,
>
> Yes, that SRS server does not exist anymore. The EBI SRS server is there
> and updated regurarly, but I am not sure if it offers Genbank. It does
> offer a full version of EMBL (nucleotide database, the contents of the
> release should mirror sync those of Genbank), so if you type the
> following in your emboss.default file, you will connect:
>

Try the SRS server at the 'DKFZ', see:

http://www.dkfz.de/menu/cgi-bin/srs7.1.3.1/wgetz?-page+databanks

or check the list of Public SRS Installations, see:

http://www.biowisdom.com/download/srs-parser-and-software-downloads/public-srs-installations/


(although, I am not sure whether this list is actually still maintained)


Regards, Hans


> DB embl [ type: N method: srswww format: embl
> url: "http://srs.ebi.ac.uk/cgi-bin/wgetz"
> fields: "id sv des org key"
> comment: "EMBL" ]
>
> Best regards,
> GM
>

From fermaral1981 at gmail.com  Tue Oct  4 09:38:22 2011
From: fermaral1981 at gmail.com (Fernando Martinez)
Date: Tue, 4 Oct 2011 15:38:22 +0200
Subject: [EMBOSS] uniq sequences on a list
Message-ID: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>

Hi, I am trying to retrieve sequences from a multi-fasta file were there are
identical sequences and i want to extract only the ones in my list, how can
I do that?
Example:

Multi.fasta file:

>seq1
atataga...
>seq2
ttatggttca..
[...]
>seq1
atataga...
[...]

and my list is:

Multi.fasta:seq1
Multi.fasta:seq2

When I run "seqret @list -out out.fasta"

I retrieve :
>seq1
atataga...
>seq2
ttatggttca...
>seq1
atataga...

And I only want to take seq1 an seq2, not two times seq1!!

thanks

From pmr at ebi.ac.uk  Tue Oct  4 10:13:21 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 04 Oct 2011 15:13:21 +0100
Subject: [EMBOSS] uniq sequences on a list
In-Reply-To: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>
References: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>
Message-ID: <4E8B1481.9060305@ebi.ac.uk>

On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> identical sequences and i want to extract only the ones in my list, how can
> I do that?
> Example:
>
> Multi.fasta file:
>
>> seq1
> atataga...
>> seq2
> ttatggttca..
> [...]
>> seq1
> atataga...
> [...]
> And I only want to take seq1 an seq2, not two times seq1!!

If you really must start from that file .... as usual with EMBOSS there 
are several ways to do it

1. Index with dbifasta
----------------------

You can index with the older dbifasta program. This does not allow 
duplicate IDs so only one seq1 will be indexed.

% dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat 
simple -auto

Then define a database in your .embossrc file:

DB multi [
   format: "fasta"
   method: "emblcd"
   type: "nucleotide"
   directory: "."
]

Then replace "Multi.fasta" in your listfile with "multi" and you will 
have the sequences you want.


2. rewrite as single files in a new directory, then rewrite as one file

% mkdir multi
% seqret -ossingle -odsir multi Multi.fasta -auto
% ls multi
seq1.fasta  seq2.fasta ...

% cd multi
seqret '*.fasta' ../Single.fasta

(note: you do need the quotes around the wild card file name)

this will give you a file Single.fasta in the original directory with 
only the last version of each id.


3. Write a new application
---------------------------

Another approach is to write your own new application. A copy of seqret 
which keeps a table of ids and rejects any sequence with known ID will 
rewrite the file (in any format) with only the first occurrence of each 
id. We will add this to the next release.


4.  ... there may be more ways, but these will be enough to solve your 
problem.

Hope that helps,

Peter Rice
EMBOSS Team

From fermaral1981 at gmail.com  Wed Oct  5 06:52:43 2011
From: fermaral1981 at gmail.com (Fernando =?ISO-8859-1?Q?Mart=EDnez-Alberola?=)
Date: Wed, 05 Oct 2011 12:52:43 +0200
Subject: [EMBOSS] uniq sequences on a list
In-Reply-To: <4E8B1481.9060305@ebi.ac.uk>
References: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>
	<4E8B1481.9060305@ebi.ac.uk>
Message-ID: <1317811963.14315.1016.camel@cladonia2-desktop>

El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribi?: 
> On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> > Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> > identical sequences and i want to extract only the ones in my list, how can
> > I do that?
> > Example:
> >
> > Multi.fasta file:
> >
> >> seq1
> > atataga...
> >> seq2
> > ttatggttca..
> > [...]
> >> seq1
> > atataga...
> > [...]
> > And I only want to take seq1 an seq2, not two times seq1!!
> 
> If you really must start from that file .... as usual with EMBOSS there 
> are several ways to do it
> 
> 1. Index with dbifasta
> ----------------------
> 
> You can index with the older dbifasta program. This does not allow 
> duplicate IDs so only one seq1 will be indexed.
> 
> % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat 
> simple -auto
> 
> Then define a database in your .embossrc file:
> 
> DB multi [
>    format: "fasta"
>    method: "emblcd"
>    type: "nucleotide"
>    directory: "."
> ]
> 
> Then replace "Multi.fasta" in your listfile with "multi" and you will 
> have the sequences you want.
> 
> 
> 
> 2. rewrite as single files in a new directory, then rewrite as one file
> 
> % mkdir multi
> % seqret -ossingle -odsir multi Multi.fasta -auto
> % ls multi
> seq1.fasta  seq2.fasta ...
> 
> % cd multi
> seqret '*.fasta' ../Single.fasta
> 
> (note: you do need the quotes around the wild card file name)
> 
> this will give you a file Single.fasta in the original directory with 
> only the last version of each id.
> 
> 
> 
> 3. Write a new application
> ---------------------------
> 
> Another approach is to write your own new application. A copy of seqret 
> which keeps a table of ids and rejects any sequence with known ID will 
> rewrite the file (in any format) with only the first occurrence of each 
> id. We will add this to the next release.
> 
> 
> 4.  ... there may be more ways, but these will be enough to solve your 
> problem.
> 
> Hope that helps,
> 
> Peter Rice
> EMBOSS Team

Thanks, your help was very useful, in particular the second mode.
Best regards, Fernando


From pmr at ebi.ac.uk  Wed Oct  5 08:26:05 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 05 Oct 2011 13:26:05 +0100
Subject: [EMBOSS] Remote Genbank from NCBI?
In-Reply-To: <4E8608C8.3090502@creighton.edu>
References: <4E8608C8.3090502@creighton.edu>
Message-ID: <4E8C4CDD.9010500@ebi.ac.uk>

On 09/30/2011 07:22 PM, Ed Siefker wrote:
> Is there a way to access NCBI Genbank remotely?

The SRS server at DKFZ is defined as a server in EMBOSS 6.4.0.0 so you 
can use it with no extra definition:

seqret dkfz:genbank:x13666

You can also use query fields, for example:

seqret dkfz:genbank-id:x13776
seqret dkfz:genbank-acc:x13776
seqret 'dkfz:genbank-des:{amic & amir}'


The release should also support the NCBI Entrez server but there is a 
bug in parsing the header. I will add a fix to the next patch. Then you 
could also use entrez:nucleotide:x13776 which reads the genbank format 
of the entry.

Hope this helps,

Peter Rice
EMBOSS Team


From ajb at ebi.ac.uk  Wed Oct  5 11:05:30 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Wed, 5 Oct 2011 16:05:30 +0100 (BST)
Subject: [EMBOSS] EMBOSS patch set 1-24 available. New mEMBOSS available.
Message-ID: <39217.82.26.12.214.1317827130.squirrel@imap04.ebi.ac.uk>

New bug-fix files are available for EMBOSS-6.4.0 and, for Windows
users, a new version of mEMBOSS is available.

The bugs fixed include those recently fixed (22-24), listed below,
and all those fixed by previous patches (1-21).

1) UNIX

As usual, the most convenient way of applying the bug-fixes is
to apply the patch file:

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-24.gz

to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code
and recompiling/installing.

(see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch
 for instructions on using 'patch').

Alternatively, you can individually copy the patched files
from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory
if your system does not support 'patch'.

2) mEMBOSS

The new version incorporates all new and previous bug-fixes.
Uninstall your previous mEMBOSS installation and download and install
the new setup file from:

ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.4-setup.exe


Alan

-----------------------------------------------------------------------

Fix 22. EMBOSS-6.4.0/emboss/diffseq.c
    	EMBOSS-6.4.0/ajax/core/ajreport.c

14-Sep-2011: Diffseq reports insertions in the second sequence with a
	     length 2 reversed region in the first sequence instead of
	     a length 0 empty sequence. This bug was introduced in
	     release 6.0.0 when reversed sequence features were updated.

Fix 23. EMBOSS-6.4.0/ajax/core/ajindex.c

04-Oct-2011: Dbx index files from earlier releases do not include a
	     type parameter to indicate an Identifier or Secondary
	     index. The code to test index field names failed to
	     define id and acc fields as Identifiers. This fix allows
	     old indexes to work with EMBOSS 6.4.0.

Fix 24. EMBOSS-6.4.0/ajax/core/ajfileio.c

05-Oct-2011: Trimming carriage controls from the ends of lines in a
	     buffer failed when MacOSX-style characters are used and
	     the line buffer is a reference counted string. An example
	     on non-MacOSX systems was processing the data returned by
	     the NCBI Entrez server.


From aengus.stewart at cancer.org.uk  Wed Oct 12 11:50:36 2011
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Wed, 12 Oct 2011 16:50:36 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E8608C8.3090502@creighton.edu>
References: <4E8608C8.3090502@creighton.edu>
Message-ID: <4E95B74C.10107@cancer.org.uk>

Hi Folks,

I couldnt see a command line option to do what I wanted ie return non-overlapping hits.

This is best explained with some sample output.

#=======================================
#
# Sequence: chr1_174353258_174354335     from: 1   to: 200
# HitCount: 9
#
# Pattern_name Mismatch Pattern
# pattern1            3 CC[AT](6)GG
#
# Complement: No
#
#=======================================

   Start     End  Strand Pattern_name Mismatch Sequence
      54      63       + pattern1            3 GCCAAATAAG
      55      64       + pattern1            . CCAAATAAGG
      56      65       + pattern1            2 CAAATAAGGG
     104     113       + pattern1            1 CCTAAATAAG
     105     114       + pattern1            1 CTAAATAAGG
     106     115       + pattern1            3 TAAATAAGGG
     179     188       + pattern1            2 CCTTGCTTGG
     190     199       + pattern1            3 CCGATTAGAG
     191     200       + pattern1            3 CGATTAGAGC

As you can see this is actually only 4 hits rather than the 9 reported.

I can do this myself with another script but I was wondering if it could be an option?


regards
Aengus

-- 
-----------------------------------------------------------------------
Aengus Stewart                                 Tel: +44 (0)20 7269 3679
Head of Bioinformatics and BioStatistics
CRUK London Research Institute
Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. 

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. 
Cancer Research UK
Registered in England and Wales
Company Registered Number: 4325234.
Registered Charity Number: 1089464 and Scotland SC041666
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.

From pmr at ebi.ac.uk  Wed Oct 12 20:02:08 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 13 Oct 2011 01:02:08 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E95B74C.10107@cancer.org.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
Message-ID: <4E962A80.1070907@ebi.ac.uk>

On 12/10/2011 16:50, Aengus Stewart wrote:
> Hi Folks,
>
> I couldnt see a command line option to do what I wanted ie return
> non-overlapping hits.
>
> This is best explained with some sample output.
>
> #=======================================
> #
> # Sequence: chr1_174353258_174354335 from: 1 to: 200
> # HitCount: 9
> #
> # Pattern_name Mismatch Pattern
> # pattern1 3 CC[AT](6)GG
>
> As you can see this is actually only 4 hits rather than the 9 reported.

Hmmm ... with that kind of pattern and 3 mismatches there are pretty 
sure to be overlapping matches.

Trouble is, which matches would you want to keep? Your second match, for 
example, has 2 hits with 1 mismatch at 104..115 and 105..116

It should be possible to come up with patterns where the choice of 'best 
hit' complicates which hits are considered to overlap.

Probably writing a script is your best bet as you can then control which 
hits are picked.

We could try to write an application to remove overlapping features ... 
if someone can define how to select them. In this case, the mismatch 
number will be stored as a tag (feature qualifier) in the feature table 
and could be included in the selection criteria.

Hope this helps ... and maybe sparks some ideas

Peter Rice
EMBOSS Team

From jison at ebi.ac.uk  Thu Oct 13 03:45:58 2011
From: jison at ebi.ac.uk (Jon Ison)
Date: Thu, 13 Oct 2011 08:45:58 +0100 (BST)
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E962A80.1070907@ebi.ac.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
	<4E962A80.1070907@ebi.ac.uk>
Message-ID: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>

Hi chaps (Aengus !)

If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for
a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g.

   Start     End  Strand Pattern_name Mismatch Sequence
      54      65       + pattern1            5 GCCAAATAAGGG
     104     115       + pattern1            5 CCTAAATAAGGG
     179     188       + pattern1            2 CCTTGCTTGG
     190     200       + pattern1            6 CCGATTAGAGC

Mismatch in this case is reporting the sum of mismatches from before.  A column for number of
(sub)matches would also be needed.  Is that right Aengus?

The above might give a useful result depending in the input pattern.  It would I think be easy
enough to implement.

Cheers

Jon


> On 12/10/2011 16:50, Aengus Stewart wrote:
>> Hi Folks,
>>
>> I couldnt see a command line option to do what I wanted ie return
>> non-overlapping hits.
>>
>> This is best explained with some sample output.
>>
>> #=======================================
>> #
>> # Sequence: chr1_174353258_174354335 from: 1 to: 200
>> # HitCount: 9
>> #
>> # Pattern_name Mismatch Pattern
>> # pattern1 3 CC[AT](6)GG
>>
>> As you can see this is actually only 4 hits rather than the 9 reported.
>
> Hmmm ... with that kind of pattern and 3 mismatches there are pretty
> sure to be overlapping matches.
>
> Trouble is, which matches would you want to keep? Your second match, for
> example, has 2 hits with 1 mismatch at 104..115 and 105..116
>
> It should be possible to come up with patterns where the choice of 'best
> hit' complicates which hits are considered to overlap.
>
> Probably writing a script is your best bet as you can then control which
> hits are picked.
>
> We could try to write an application to remove overlapping features ...
> if someone can define how to select them. In this case, the mismatch
> number will be stored as a tag (feature qualifier) in the feature table
> and could be included in the selection criteria.
>
> Hope this helps ... and maybe sparks some ideas
>
> Peter Rice
> EMBOSS Team
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From pmr at ebi.ac.uk  Thu Oct 13 04:44:33 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 13 Oct 2011 09:44:33 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
	<4E962A80.1070907@ebi.ac.uk>
	<45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>
Message-ID: <4E96A4F1.4050303@ebi.ac.uk>

On 13/10/2011 08:45, Jon Ison wrote:
> Hi chaps (Aengus !)
>
> If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for
> a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g.
>
>     Start     End  Strand Pattern_name Mismatch Sequence
>        54      65       + pattern1            5 GCCAAATAAGGG
>       104     115       + pattern1            5 CCTAAATAAGGG
>       179     188       + pattern1            2 CCTTGCTTGG
>       190     200       + pattern1            6 CCGATTAGAGC
>
> Mismatch in this case is reporting the sum of mismatches from before.  A column for number of
> (sub)matches would also be needed.  Is that right Aengus?

I'm not sure that adding the mismatches is sound. I'd assume just a best 
hit from the overlapping matches.

> The above might give a useful result depending in the input pattern.  It would I think be easy
> enough to implement.

This is a report output, so post-processing could be done by trimming 
the results before output using an associated qualifier.

Still not sure how useful it would be, we need more feedback from other 
users on this one please!

Peter Rice
EMBOSS Team


From aengus.stewart at cancer.org.uk  Thu Oct 13 05:31:56 2011
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Thu, 13 Oct 2011 10:31:56 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E96A4F1.4050303@ebi.ac.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
	<4E962A80.1070907@ebi.ac.uk>
	<45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>
	<4E96A4F1.4050303@ebi.ac.uk>
Message-ID: <4E96B00C.80806@cancer.org.uk>


So Peter is right about what I want returned - the best match, but of course has pointed out the problem with having 2 best matches for the same region ( in this example 104-113, 105-114).  However, it is still the case that the "real" result is 4 hits rather than 9.

I dont know if my example is a special case or not so it would be good as Peter suggests if someone else has used fuzznuc in a similar way.  Though surely if you include any mismatch at all for your pattern search then you automatically have this scenario of returning multiple results for the same location?


Cheers
Aengus


On 13/10/11 09:44, Peter Rice wrote:
> On 13/10/2011 08:45, Jon Ison wrote:
>> Hi chaps (Aengus !)
>>
>> If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for
>> a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g.
>>
>>      Start     End  Strand Pattern_name Mismatch Sequence
>>         54      65       + pattern1            5 GCCAAATAAGGG
>>        104     115       + pattern1            5 CCTAAATAAGGG
>>        179     188       + pattern1            2 CCTTGCTTGG
>>        190     200       + pattern1            6 CCGATTAGAGC
>>
>> Mismatch in this case is reporting the sum of mismatches from before.  A column for number of
>> (sub)matches would also be needed.  Is that right Aengus?
>
> I'm not sure that adding the mismatches is sound. I'd assume just a best
> hit from the overlapping matches.
>
>> The above might give a useful result depending in the input pattern.  It would I think be easy
>> enough to implement.
>
> This is a report output, so post-processing could be done by trimming
> the results before output using an associated qualifier.
>
> Still not sure how useful it would be, we need more feedback from other
> users on this one please!
>
> Peter Rice
> EMBOSS Team
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


-- 
-----------------------------------------------------------------------
Aengus Stewart                                 Tel: +44 (0)20 7269 3679
Head of Bioinformatics and BioStatistics
CRUK London Research Institute
Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. 

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. 
Cancer Research UK
Registered in England and Wales
Company Registered Number: 4325234.
Registered Charity Number: 1089464 and Scotland SC041666
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.

From peter.r.hoyt at okstate.edu  Thu Oct 20 12:29:47 2011
From: peter.r.hoyt at okstate.edu (peter.r.hoyt at okstate.edu)
Date: Thu, 20 Oct 2011 11:29:47 -0500
Subject: [EMBOSS] Sorry, Windows problem. jEMBOSS upgrade to still says v 1.5
Message-ID: <4EA04C7B.70008@okstate.edu>

So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. 
In my previous installs, I had used CygWin, but this time, could NOT get 
CygWin install to work (I really tried!). So I settled for the Windows 
setup file. Now I have jEMBOSS running fine, but it still says version 
1.5. Is this correct? The jEMBOSS version hasn't changed?

My next question coming soon!

Pete

From ajb at ebi.ac.uk  Mon Oct 24 08:56:34 2011
From: ajb at ebi.ac.uk (Alan Bleasby)
Date: Mon, 24 Oct 2011 13:56:34 +0100
Subject: [EMBOSS] Sorry,
 Windows problem. jEMBOSS upgrade to still says v 1.5
In-Reply-To: <4EA04C7B.70008@okstate.edu>
References: <4EA04C7B.70008@okstate.edu>
Message-ID: <4EA56082.1080307@ebi.ac.uk>

Hello Pete,

This one seems to have remained unanswered.

Yes, the Jemboss version is still 1.5. The GUI has continued to be 
updated but the version number
has remained the same for quite a while (an oversight on our part, 
thanks for highlighting it).

Of course, to show the version of EMBOSS itself, you use the 'embossversion'
application, which should show 6.4.0.4, within mEMBOSS, for the version 
you've installed.

HTH

Alan


On 10/20/2011 05:29 PM, peter.r.hoyt at okstate.edu wrote:
> So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. 
> In my previous installs, I had used CygWin, but this time, could NOT 
> get CygWin install to work (I really tried!). So I settled for the 
> Windows setup file. Now I have jEMBOSS running fine, but it still says 
> version 1.5. Is this correct? The jEMBOSS version hasn't changed?
>
> My next question coming soon!
>
> Pete
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From bernd.web at gmail.com  Fri Oct 28 13:03:15 2011
From: bernd.web at gmail.com (Bernd Web)
Date: Fri, 28 Oct 2011 19:03:15 +0200
Subject: [EMBOSS] fuzznuc pattern expansion
Message-ID: <CAExAtoDc1KrgnRy=_haJqOG-+STgDxiPQFhE5HkucmEjmQRmQA@mail.gmail.com>

Hi

Using fuzznuc I get illegal pattern warnings. I realize what is going on:

"You can use ambiguity codes for nucleic acid searches but not within
[] or {} as they expand to bracketed counterparts. For example, "s" is
expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is
illegal."

However, what I cannot find it how to suppress this expansion. Is this
possible? We actually need to have these ambiguity remain as they are
within [] as the input sequences can contain R, Y, B, N themselves for
example. Thus, [GCS] is a pattern we actually want to be able to use.


Kind regards,
Bernd

From pmr at ebi.ac.uk  Sat Oct 29 13:06:13 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Sat, 29 Oct 2011 18:06:13 +0100
Subject: [EMBOSS] fuzznuc pattern expansion
In-Reply-To: <CAExAtoDc1KrgnRy=_haJqOG-+STgDxiPQFhE5HkucmEjmQRmQA@mail.gmail.com>
References: <CAExAtoDc1KrgnRy=_haJqOG-+STgDxiPQFhE5HkucmEjmQRmQA@mail.gmail.com>
Message-ID: <4EAC3285.7080501@ebi.ac.uk>

On 28/10/2011 18:03, Bernd Web wrote:
> Hi
>
> Using fuzznuc I get illegal pattern warnings. I realize what is going on:
>
> "You can use ambiguity codes for nucleic acid searches but not within
> [] or {} as they expand to bracketed counterparts. For example, "s" is
> expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is
> illegal."
>
> However, what I cannot find it how to suppress this expansion. Is this
> possible? We actually need to have these ambiguity remain as they are
> within [] as the input sequences can contain R, Y, B, N themselves for
> example. Thus, [GCS] is a pattern we actually want to be able to use.

That looks a reasonable suggestion.

We can replace S with [GCS] directly. For the wider ambiguity codes, we 
can replace them with the subsets:

B [TGCBSYK]
D [TGADWRK]
H [TCAHWYM]
V [GCAVSRM]

We can also allow 'C\S' to explicitly match CS in the input sequence by 
escaping the S to skip the automatic expansion.

These changes can be added to the next release.

Thanks for the idea.

Peter Rice
EMBOSS Team


From georgios at biotek.uio.no  Sat Oct  1 09:37:50 2011
From: georgios at biotek.uio.no (Georgios Magklaras)
Date: Sat, 01 Oct 2011 11:37:50 +0200
Subject: [EMBOSS] Remote Genbank from NCBI?
In-Reply-To: <4E8608C8.3090502@creighton.edu>
References: <4E8608C8.3090502@creighton.edu>
Message-ID: <4E86DF6E.6090505@biotek.uio.no>

On 09/30/2011 08:22 PM, Ed Siefker wrote:
> Is there a way to access NCBI Genbank remotely?
> My emboss.default contains the following:
>
> DB tgb [ type: N method: srswww format: genbank
>    url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz"
>    dbalias: genbankrelease
>    fields: "sv des org key"
>    comment: "Genbank IDs" ]
>
>
> However that server does not exist.  I've looked on
> the NCBI website for alternatives, but all I can find
> is the ftp site.  I've also read the EMBOSS admin guide.
> The examples there use infobiogen.fr, which is also
> closed.
>
> So what do people do for genbank access?  I'd prefer
> to avoid setting up a local database myself if I can.
> Is there a list of genbank mirrors around somewhere?
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
Hi Ed,

Yes, that SRS server does not exist anymore. The EBI SRS server is there 
and updated regurarly, but I am not sure if it offers Genbank. It does 
offer a full version of EMBL (nucleotide database, the contents of the 
release should mirror sync those of Genbank), so if you type the 
following in your emboss.default file, you will connect:

DB embl [ type: N method: srswww format: embl
    url: "http://srs.ebi.ac.uk/cgi-bin/wgetz"
    fields: "id sv des org key"
    comment: "EMBL" ]

Best regards,
GM

-- 
-- 
George Magklaras PhD
RHCE no: 805008309135525

Senior Systems Engineer/IT Manager
Biotek Center, University of Oslo
EMBnet TMPC Chair

http://folk.uio.no/georgios

Tel: +47 22840535


From hrh at fmi.ch  Sat Oct  1 11:15:41 2011
From: hrh at fmi.ch (Hans-Rudolf Hotz)
Date: Sat, 01 Oct 2011 13:15:41 +0200
Subject: [EMBOSS] Remote Genbank from NCBI?
In-Reply-To: <4E86DF6E.6090505@biotek.uio.no>
References: <4E8608C8.3090502@creighton.edu> <4E86DF6E.6090505@biotek.uio.no>
Message-ID: <4E86F65D.1090803@fmi.ch>


On 10/01/2011 11:37 AM, Georgios Magklaras wrote:
> On 09/30/2011 08:22 PM, Ed Siefker wrote:
>> Is there a way to access NCBI Genbank remotely?
>> My emboss.default contains the following:
>>
>> DB tgb [ type: N method: srswww format: genbank
>> url: "http://cbr-rbc.nrc-cnrc.gc.ca/srs6bin/cgi-bin/wgetz"
>> dbalias: genbankrelease
>> fields: "sv des org key"
>> comment: "Genbank IDs" ]
>>
>>
>> However that server does not exist. I've looked on
>> the NCBI website for alternatives, but all I can find
>> is the ftp site. I've also read the EMBOSS admin guide.
>> The examples there use infobiogen.fr, which is also
>> closed.
>>
>> So what do people do for genbank access? I'd prefer
>> to avoid setting up a local database myself if I can.
>> Is there a list of genbank mirrors around somewhere?
>>
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
> Hi Ed,
>
> Yes, that SRS server does not exist anymore. The EBI SRS server is there
> and updated regurarly, but I am not sure if it offers Genbank. It does
> offer a full version of EMBL (nucleotide database, the contents of the
> release should mirror sync those of Genbank), so if you type the
> following in your emboss.default file, you will connect:
>

Try the SRS server at the 'DKFZ', see:

http://www.dkfz.de/menu/cgi-bin/srs7.1.3.1/wgetz?-page+databanks

or check the list of Public SRS Installations, see:

http://www.biowisdom.com/download/srs-parser-and-software-downloads/public-srs-installations/


(although, I am not sure whether this list is actually still maintained)


Regards, Hans


> DB embl [ type: N method: srswww format: embl
> url: "http://srs.ebi.ac.uk/cgi-bin/wgetz"
> fields: "id sv des org key"
> comment: "EMBL" ]
>
> Best regards,
> GM
>


From fermaral1981 at gmail.com  Tue Oct  4 13:38:22 2011
From: fermaral1981 at gmail.com (Fernando Martinez)
Date: Tue, 4 Oct 2011 15:38:22 +0200
Subject: [EMBOSS] uniq sequences on a list
Message-ID: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>

Hi, I am trying to retrieve sequences from a multi-fasta file were there are
identical sequences and i want to extract only the ones in my list, how can
I do that?
Example:

Multi.fasta file:

>seq1
atataga...
>seq2
ttatggttca..
[...]
>seq1
atataga...
[...]

and my list is:

Multi.fasta:seq1
Multi.fasta:seq2

When I run "seqret @list -out out.fasta"

I retrieve :
>seq1
atataga...
>seq2
ttatggttca...
>seq1
atataga...

And I only want to take seq1 an seq2, not two times seq1!!

thanks


From pmr at ebi.ac.uk  Tue Oct  4 14:13:21 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 04 Oct 2011 15:13:21 +0100
Subject: [EMBOSS] uniq sequences on a list
In-Reply-To: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>
References: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>
Message-ID: <4E8B1481.9060305@ebi.ac.uk>

On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> identical sequences and i want to extract only the ones in my list, how can
> I do that?
> Example:
>
> Multi.fasta file:
>
>> seq1
> atataga...
>> seq2
> ttatggttca..
> [...]
>> seq1
> atataga...
> [...]
> And I only want to take seq1 an seq2, not two times seq1!!

If you really must start from that file .... as usual with EMBOSS there 
are several ways to do it

1. Index with dbifasta
----------------------

You can index with the older dbifasta program. This does not allow 
duplicate IDs so only one seq1 will be indexed.

% dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat 
simple -auto

Then define a database in your .embossrc file:

DB multi [
   format: "fasta"
   method: "emblcd"
   type: "nucleotide"
   directory: "."
]

Then replace "Multi.fasta" in your listfile with "multi" and you will 
have the sequences you want.


2. rewrite as single files in a new directory, then rewrite as one file

% mkdir multi
% seqret -ossingle -odsir multi Multi.fasta -auto
% ls multi
seq1.fasta  seq2.fasta ...

% cd multi
seqret '*.fasta' ../Single.fasta

(note: you do need the quotes around the wild card file name)

this will give you a file Single.fasta in the original directory with 
only the last version of each id.


3. Write a new application
---------------------------

Another approach is to write your own new application. A copy of seqret 
which keeps a table of ids and rejects any sequence with known ID will 
rewrite the file (in any format) with only the first occurrence of each 
id. We will add this to the next release.


4.  ... there may be more ways, but these will be enough to solve your 
problem.

Hope that helps,

Peter Rice
EMBOSS Team


From fermaral1981 at gmail.com  Wed Oct  5 10:52:43 2011
From: fermaral1981 at gmail.com (Fernando =?ISO-8859-1?Q?Mart=EDnez-Alberola?=)
Date: Wed, 05 Oct 2011 12:52:43 +0200
Subject: [EMBOSS] uniq sequences on a list
In-Reply-To: <4E8B1481.9060305@ebi.ac.uk>
References: <CAPuaYk8BOE7zhP7fAJFiW-_E_Oub11s51S6YTp=6qH5LZNObqw@mail.gmail.com>
	<4E8B1481.9060305@ebi.ac.uk>
Message-ID: <1317811963.14315.1016.camel@cladonia2-desktop>

El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribi?: 
> On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> > Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> > identical sequences and i want to extract only the ones in my list, how can
> > I do that?
> > Example:
> >
> > Multi.fasta file:
> >
> >> seq1
> > atataga...
> >> seq2
> > ttatggttca..
> > [...]
> >> seq1
> > atataga...
> > [...]
> > And I only want to take seq1 an seq2, not two times seq1!!
> 
> If you really must start from that file .... as usual with EMBOSS there 
> are several ways to do it
> 
> 1. Index with dbifasta
> ----------------------
> 
> You can index with the older dbifasta program. This does not allow 
> duplicate IDs so only one seq1 will be indexed.
> 
> % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat 
> simple -auto
> 
> Then define a database in your .embossrc file:
> 
> DB multi [
>    format: "fasta"
>    method: "emblcd"
>    type: "nucleotide"
>    directory: "."
> ]
> 
> Then replace "Multi.fasta" in your listfile with "multi" and you will 
> have the sequences you want.
> 
> 
> 
> 2. rewrite as single files in a new directory, then rewrite as one file
> 
> % mkdir multi
> % seqret -ossingle -odsir multi Multi.fasta -auto
> % ls multi
> seq1.fasta  seq2.fasta ...
> 
> % cd multi
> seqret '*.fasta' ../Single.fasta
> 
> (note: you do need the quotes around the wild card file name)
> 
> this will give you a file Single.fasta in the original directory with 
> only the last version of each id.
> 
> 
> 
> 3. Write a new application
> ---------------------------
> 
> Another approach is to write your own new application. A copy of seqret 
> which keeps a table of ids and rejects any sequence with known ID will 
> rewrite the file (in any format) with only the first occurrence of each 
> id. We will add this to the next release.
> 
> 
> 4.  ... there may be more ways, but these will be enough to solve your 
> problem.
> 
> Hope that helps,
> 
> Peter Rice
> EMBOSS Team

Thanks, your help was very useful, in particular the second mode.
Best regards, Fernando


From pmr at ebi.ac.uk  Wed Oct  5 12:26:05 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 05 Oct 2011 13:26:05 +0100
Subject: [EMBOSS] Remote Genbank from NCBI?
In-Reply-To: <4E8608C8.3090502@creighton.edu>
References: <4E8608C8.3090502@creighton.edu>
Message-ID: <4E8C4CDD.9010500@ebi.ac.uk>

On 09/30/2011 07:22 PM, Ed Siefker wrote:
> Is there a way to access NCBI Genbank remotely?

The SRS server at DKFZ is defined as a server in EMBOSS 6.4.0.0 so you 
can use it with no extra definition:

seqret dkfz:genbank:x13666

You can also use query fields, for example:

seqret dkfz:genbank-id:x13776
seqret dkfz:genbank-acc:x13776
seqret 'dkfz:genbank-des:{amic & amir}'


The release should also support the NCBI Entrez server but there is a 
bug in parsing the header. I will add a fix to the next patch. Then you 
could also use entrez:nucleotide:x13776 which reads the genbank format 
of the entry.

Hope this helps,

Peter Rice
EMBOSS Team


From ajb at ebi.ac.uk  Wed Oct  5 15:05:30 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Wed, 5 Oct 2011 16:05:30 +0100 (BST)
Subject: [EMBOSS] EMBOSS patch set 1-24 available. New mEMBOSS available.
Message-ID: <39217.82.26.12.214.1317827130.squirrel@imap04.ebi.ac.uk>

New bug-fix files are available for EMBOSS-6.4.0 and, for Windows
users, a new version of mEMBOSS is available.

The bugs fixed include those recently fixed (22-24), listed below,
and all those fixed by previous patches (1-21).

1) UNIX

As usual, the most convenient way of applying the bug-fixes is
to apply the patch file:

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-24.gz

to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code
and recompiling/installing.

(see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch
 for instructions on using 'patch').

Alternatively, you can individually copy the patched files
from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory
if your system does not support 'patch'.

2) mEMBOSS

The new version incorporates all new and previous bug-fixes.
Uninstall your previous mEMBOSS installation and download and install
the new setup file from:

ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.4-setup.exe


Alan

-----------------------------------------------------------------------

Fix 22. EMBOSS-6.4.0/emboss/diffseq.c
    	EMBOSS-6.4.0/ajax/core/ajreport.c

14-Sep-2011: Diffseq reports insertions in the second sequence with a
	     length 2 reversed region in the first sequence instead of
	     a length 0 empty sequence. This bug was introduced in
	     release 6.0.0 when reversed sequence features were updated.

Fix 23. EMBOSS-6.4.0/ajax/core/ajindex.c

04-Oct-2011: Dbx index files from earlier releases do not include a
	     type parameter to indicate an Identifier or Secondary
	     index. The code to test index field names failed to
	     define id and acc fields as Identifiers. This fix allows
	     old indexes to work with EMBOSS 6.4.0.

Fix 24. EMBOSS-6.4.0/ajax/core/ajfileio.c

05-Oct-2011: Trimming carriage controls from the ends of lines in a
	     buffer failed when MacOSX-style characters are used and
	     the line buffer is a reference counted string. An example
	     on non-MacOSX systems was processing the data returned by
	     the NCBI Entrez server.


From aengus.stewart at cancer.org.uk  Wed Oct 12 15:50:36 2011
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Wed, 12 Oct 2011 16:50:36 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E8608C8.3090502@creighton.edu>
References: <4E8608C8.3090502@creighton.edu>
Message-ID: <4E95B74C.10107@cancer.org.uk>

Hi Folks,

I couldnt see a command line option to do what I wanted ie return non-overlapping hits.

This is best explained with some sample output.

#=======================================
#
# Sequence: chr1_174353258_174354335     from: 1   to: 200
# HitCount: 9
#
# Pattern_name Mismatch Pattern
# pattern1            3 CC[AT](6)GG
#
# Complement: No
#
#=======================================

   Start     End  Strand Pattern_name Mismatch Sequence
      54      63       + pattern1            3 GCCAAATAAG
      55      64       + pattern1            . CCAAATAAGG
      56      65       + pattern1            2 CAAATAAGGG
     104     113       + pattern1            1 CCTAAATAAG
     105     114       + pattern1            1 CTAAATAAGG
     106     115       + pattern1            3 TAAATAAGGG
     179     188       + pattern1            2 CCTTGCTTGG
     190     199       + pattern1            3 CCGATTAGAG
     191     200       + pattern1            3 CGATTAGAGC

As you can see this is actually only 4 hits rather than the 9 reported.

I can do this myself with another script but I was wondering if it could be an option?


regards
Aengus

-- 
-----------------------------------------------------------------------
Aengus Stewart                                 Tel: +44 (0)20 7269 3679
Head of Bioinformatics and BioStatistics
CRUK London Research Institute
Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. 

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. 
Cancer Research UK
Registered in England and Wales
Company Registered Number: 4325234.
Registered Charity Number: 1089464 and Scotland SC041666
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.


From pmr at ebi.ac.uk  Thu Oct 13 00:02:08 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 13 Oct 2011 01:02:08 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E95B74C.10107@cancer.org.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
Message-ID: <4E962A80.1070907@ebi.ac.uk>

On 12/10/2011 16:50, Aengus Stewart wrote:
> Hi Folks,
>
> I couldnt see a command line option to do what I wanted ie return
> non-overlapping hits.
>
> This is best explained with some sample output.
>
> #=======================================
> #
> # Sequence: chr1_174353258_174354335 from: 1 to: 200
> # HitCount: 9
> #
> # Pattern_name Mismatch Pattern
> # pattern1 3 CC[AT](6)GG
>
> As you can see this is actually only 4 hits rather than the 9 reported.

Hmmm ... with that kind of pattern and 3 mismatches there are pretty 
sure to be overlapping matches.

Trouble is, which matches would you want to keep? Your second match, for 
example, has 2 hits with 1 mismatch at 104..115 and 105..116

It should be possible to come up with patterns where the choice of 'best 
hit' complicates which hits are considered to overlap.

Probably writing a script is your best bet as you can then control which 
hits are picked.

We could try to write an application to remove overlapping features ... 
if someone can define how to select them. In this case, the mismatch 
number will be stored as a tag (feature qualifier) in the feature table 
and could be included in the selection criteria.

Hope this helps ... and maybe sparks some ideas

Peter Rice
EMBOSS Team


From jison at ebi.ac.uk  Thu Oct 13 07:45:58 2011
From: jison at ebi.ac.uk (Jon Ison)
Date: Thu, 13 Oct 2011 08:45:58 +0100 (BST)
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E962A80.1070907@ebi.ac.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
	<4E962A80.1070907@ebi.ac.uk>
Message-ID: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>

Hi chaps (Aengus !)

If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for
a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g.

   Start     End  Strand Pattern_name Mismatch Sequence
      54      65       + pattern1            5 GCCAAATAAGGG
     104     115       + pattern1            5 CCTAAATAAGGG
     179     188       + pattern1            2 CCTTGCTTGG
     190     200       + pattern1            6 CCGATTAGAGC

Mismatch in this case is reporting the sum of mismatches from before.  A column for number of
(sub)matches would also be needed.  Is that right Aengus?

The above might give a useful result depending in the input pattern.  It would I think be easy
enough to implement.

Cheers

Jon


> On 12/10/2011 16:50, Aengus Stewart wrote:
>> Hi Folks,
>>
>> I couldnt see a command line option to do what I wanted ie return
>> non-overlapping hits.
>>
>> This is best explained with some sample output.
>>
>> #=======================================
>> #
>> # Sequence: chr1_174353258_174354335 from: 1 to: 200
>> # HitCount: 9
>> #
>> # Pattern_name Mismatch Pattern
>> # pattern1 3 CC[AT](6)GG
>>
>> As you can see this is actually only 4 hits rather than the 9 reported.
>
> Hmmm ... with that kind of pattern and 3 mismatches there are pretty
> sure to be overlapping matches.
>
> Trouble is, which matches would you want to keep? Your second match, for
> example, has 2 hits with 1 mismatch at 104..115 and 105..116
>
> It should be possible to come up with patterns where the choice of 'best
> hit' complicates which hits are considered to overlap.
>
> Probably writing a script is your best bet as you can then control which
> hits are picked.
>
> We could try to write an application to remove overlapping features ...
> if someone can define how to select them. In this case, the mismatch
> number will be stored as a tag (feature qualifier) in the feature table
> and could be included in the selection criteria.
>
> Hope this helps ... and maybe sparks some ideas
>
> Peter Rice
> EMBOSS Team
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From pmr at ebi.ac.uk  Thu Oct 13 08:44:33 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 13 Oct 2011 09:44:33 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
	<4E962A80.1070907@ebi.ac.uk>
	<45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>
Message-ID: <4E96A4F1.4050303@ebi.ac.uk>

On 13/10/2011 08:45, Jon Ison wrote:
> Hi chaps (Aengus !)
>
> If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for
> a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g.
>
>     Start     End  Strand Pattern_name Mismatch Sequence
>        54      65       + pattern1            5 GCCAAATAAGGG
>       104     115       + pattern1            5 CCTAAATAAGGG
>       179     188       + pattern1            2 CCTTGCTTGG
>       190     200       + pattern1            6 CCGATTAGAGC
>
> Mismatch in this case is reporting the sum of mismatches from before.  A column for number of
> (sub)matches would also be needed.  Is that right Aengus?

I'm not sure that adding the mismatches is sound. I'd assume just a best 
hit from the overlapping matches.

> The above might give a useful result depending in the input pattern.  It would I think be easy
> enough to implement.

This is a report output, so post-processing could be done by trimming 
the results before output using an associated qualifier.

Still not sure how useful it would be, we need more feedback from other 
users on this one please!

Peter Rice
EMBOSS Team


From aengus.stewart at cancer.org.uk  Thu Oct 13 09:31:56 2011
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Thu, 13 Oct 2011 10:31:56 +0100
Subject: [EMBOSS] non-overlapping matches in fuzznuc?
In-Reply-To: <4E96A4F1.4050303@ebi.ac.uk>
References: <4E8608C8.3090502@creighton.edu> <4E95B74C.10107@cancer.org.uk>
	<4E962A80.1070907@ebi.ac.uk>
	<45540.84.92.187.247.1318491958.squirrel@webmail.ebi.ac.uk>
	<4E96A4F1.4050303@ebi.ac.uk>
Message-ID: <4E96B00C.80806@cancer.org.uk>


So Peter is right about what I want returned - the best match, but of course has pointed out the problem with having 2 best matches for the same region ( in this example 104-113, 105-114).  However, it is still the case that the "real" result is 4 hits rather than 9.

I dont know if my example is a special case or not so it would be good as Peter suggests if someone else has used fuzznuc in a similar way.  Though surely if you include any mismatch at all for your pattern search then you automatically have this scenario of returning multiple results for the same location?


Cheers
Aengus


On 13/10/11 09:44, Peter Rice wrote:
> On 13/10/2011 08:45, Jon Ison wrote:
>> Hi chaps (Aengus !)
>>
>> If I understood Aengus' msg. what's needed is something that simply combines overlapping hits (for
>> a given pattern) into one or more non-overlapping "region of hits", and reports those regions e.g.
>>
>>      Start     End  Strand Pattern_name Mismatch Sequence
>>         54      65       + pattern1            5 GCCAAATAAGGG
>>        104     115       + pattern1            5 CCTAAATAAGGG
>>        179     188       + pattern1            2 CCTTGCTTGG
>>        190     200       + pattern1            6 CCGATTAGAGC
>>
>> Mismatch in this case is reporting the sum of mismatches from before.  A column for number of
>> (sub)matches would also be needed.  Is that right Aengus?
>
> I'm not sure that adding the mismatches is sound. I'd assume just a best
> hit from the overlapping matches.
>
>> The above might give a useful result depending in the input pattern.  It would I think be easy
>> enough to implement.
>
> This is a report output, so post-processing could be done by trimming
> the results before output using an associated qualifier.
>
> Still not sure how useful it would be, we need more feedback from other
> users on this one please!
>
> Peter Rice
> EMBOSS Team
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


-- 
-----------------------------------------------------------------------
Aengus Stewart                                 Tel: +44 (0)20 7269 3679
Head of Bioinformatics and BioStatistics
CRUK London Research Institute
Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. 

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. 
Cancer Research UK
Registered in England and Wales
Company Registered Number: 4325234.
Registered Charity Number: 1089464 and Scotland SC041666
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.


From peter.r.hoyt at okstate.edu  Thu Oct 20 16:29:47 2011
From: peter.r.hoyt at okstate.edu (peter.r.hoyt at okstate.edu)
Date: Thu, 20 Oct 2011 11:29:47 -0500
Subject: [EMBOSS] Sorry, Windows problem. jEMBOSS upgrade to still says v 1.5
Message-ID: <4EA04C7B.70008@okstate.edu>

So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. 
In my previous installs, I had used CygWin, but this time, could NOT get 
CygWin install to work (I really tried!). So I settled for the Windows 
setup file. Now I have jEMBOSS running fine, but it still says version 
1.5. Is this correct? The jEMBOSS version hasn't changed?

My next question coming soon!

Pete


From ajb at ebi.ac.uk  Mon Oct 24 12:56:34 2011
From: ajb at ebi.ac.uk (Alan Bleasby)
Date: Mon, 24 Oct 2011 13:56:34 +0100
Subject: [EMBOSS] Sorry,
 Windows problem. jEMBOSS upgrade to still says v 1.5
In-Reply-To: <4EA04C7B.70008@okstate.edu>
References: <4EA04C7B.70008@okstate.edu>
Message-ID: <4EA56082.1080307@ebi.ac.uk>

Hello Pete,

This one seems to have remained unanswered.

Yes, the Jemboss version is still 1.5. The GUI has continued to be 
updated but the version number
has remained the same for quite a while (an oversight on our part, 
thanks for highlighting it).

Of course, to show the version of EMBOSS itself, you use the 'embossversion'
application, which should show 6.4.0.4, within mEMBOSS, for the version 
you've installed.

HTH

Alan


On 10/20/2011 05:29 PM, peter.r.hoyt at okstate.edu wrote:
> So I upgraded mEMBOSS (which I've been using for a while), to 6.4.0.4. 
> In my previous installs, I had used CygWin, but this time, could NOT 
> get CygWin install to work (I really tried!). So I settled for the 
> Windows setup file. Now I have jEMBOSS running fine, but it still says 
> version 1.5. Is this correct? The jEMBOSS version hasn't changed?
>
> My next question coming soon!
>
> Pete
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From bernd.web at gmail.com  Fri Oct 28 17:03:15 2011
From: bernd.web at gmail.com (Bernd Web)
Date: Fri, 28 Oct 2011 19:03:15 +0200
Subject: [EMBOSS] fuzznuc pattern expansion
Message-ID: <CAExAtoDc1KrgnRy=_haJqOG-+STgDxiPQFhE5HkucmEjmQRmQA@mail.gmail.com>

Hi

Using fuzznuc I get illegal pattern warnings. I realize what is going on:

"You can use ambiguity codes for nucleic acid searches but not within
[] or {} as they expand to bracketed counterparts. For example, "s" is
expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is
illegal."

However, what I cannot find it how to suppress this expansion. Is this
possible? We actually need to have these ambiguity remain as they are
within [] as the input sequences can contain R, Y, B, N themselves for
example. Thus, [GCS] is a pattern we actually want to be able to use.


Kind regards,
Bernd


From pmr at ebi.ac.uk  Sat Oct 29 17:06:13 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Sat, 29 Oct 2011 18:06:13 +0100
Subject: [EMBOSS] fuzznuc pattern expansion
In-Reply-To: <CAExAtoDc1KrgnRy=_haJqOG-+STgDxiPQFhE5HkucmEjmQRmQA@mail.gmail.com>
References: <CAExAtoDc1KrgnRy=_haJqOG-+STgDxiPQFhE5HkucmEjmQRmQA@mail.gmail.com>
Message-ID: <4EAC3285.7080501@ebi.ac.uk>

On 28/10/2011 18:03, Bernd Web wrote:
> Hi
>
> Using fuzznuc I get illegal pattern warnings. I realize what is going on:
>
> "You can use ambiguity codes for nucleic acid searches but not within
> [] or {} as they expand to bracketed counterparts. For example, "s" is
> expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is
> illegal."
>
> However, what I cannot find it how to suppress this expansion. Is this
> possible? We actually need to have these ambiguity remain as they are
> within [] as the input sequences can contain R, Y, B, N themselves for
> example. Thus, [GCS] is a pattern we actually want to be able to use.

That looks a reasonable suggestion.

We can replace S with [GCS] directly. For the wider ambiguity codes, we 
can replace them with the subsets:

B [TGCBSYK]
D [TGADWRK]
H [TCAHWYM]
V [GCAVSRM]

We can also allow 'C\S' to explicitly match CS in the input sequence by 
escaping the S to skip the automatic expansion.

These changes can be added to the next release.

Thanks for the idea.

Peter Rice
EMBOSS Team