From aengus.stewart at cancer.org.uk  Wed Dec  5 08:19:05 2007
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Wed, 05 Dec 2007 13:19:05 +0000
Subject: [EMBOSS] restrict -limit
Message-ID: <4756A549.1030303@cancer.org.uk>


I seem to be having trouble with restrict not picking up -limit or am I not using it correctly?

I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ???

########################################
# Program: restrict
# Rundate: Wed  5 Dec 2007 13:17:08
# Commandline: restrict
#    -sitelen 4
#    -enzymes all
#    -limit
#    -blunt
#    -single
#    [-sequence] rs9584819.ff
#    -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
# Report_format: table
# Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
########################################

#=======================================
#
# Sequence: rs9584819     from: 1   to: 101
# HitCount: 24
#
# Minimum cuts per enzyme: 1
# Maximum cuts per enzyme: 1
# Minimum length of recognition site: 4
# Blunt ends allowed
# Sticky ends allowed
# DNA is linear
# Ambiguities allowed
#
#=======================================

   Start     End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev
      13      17 BssKI       CCNGG                12     17         .         .
      13      17 BseBI       CCWGG                14     15         .         .
      13      17 ScrFI       CCNGG                14     15         .         .
      13      17 EcoRII      CCWGG                12     17         .         .


Regards
Aengus


-- 
-----------------------------------------------------------------------
Aengus Stewart
Head of Bioinformatics and BioStatistics
Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.

From aengus.stewart at cancer.org.uk  Wed Dec  5 10:00:50 2007
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Wed, 05 Dec 2007 15:00:50 +0000
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756A549.1030303@cancer.org.uk>
References: <4756A549.1030303@cancer.org.uk>
Message-ID: <4756BD22.6070902@cancer.org.uk>


Yeah I know, not one of my brightest days...............

Helps to look at cut position as well as motif *sigh*


Aengus


Aengus Stewart wrote:
> I seem to be having trouble with restrict not picking up -limit or am I not using it correctly?
> 
> I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ???
> 
> ########################################
> # Program: restrict
> # Rundate: Wed  5 Dec 2007 13:17:08
> # Commandline: restrict
> #    -sitelen 4
> #    -enzymes all
> #    -limit
> #    -blunt
> #    -single
> #    [-sequence] rs9584819.ff
> #    -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> # Report_format: table
> # Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> ########################################
> 
> #=======================================
> #
> # Sequence: rs9584819     from: 1   to: 101
> # HitCount: 24
> #
> # Minimum cuts per enzyme: 1
> # Maximum cuts per enzyme: 1
> # Minimum length of recognition site: 4
> # Blunt ends allowed
> # Sticky ends allowed
> # DNA is linear
> # Ambiguities allowed
> #
> #=======================================
> 
>    Start     End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev
>       13      17 BssKI       CCNGG                12     17         .         .
>       13      17 BseBI       CCWGG                14     15         .         .
>       13      17 ScrFI       CCNGG                14     15         .         .
>       13      17 EcoRII      CCWGG                12     17         .         .
> 
> 
> 
> 
> Regards
> Aengus
> 
> 
> 


-- 
-----------------------------------------------------------------------
Aengus Stewart
Head of Bioinformatics and BioStatistics
Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.

From ajb at ebi.ac.uk  Wed Dec  5 10:08:26 2007
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Wed, 5 Dec 2007 15:08:26 -0000 (GMT)
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756A549.1030303@cancer.org.uk>
References: <4756A549.1030303@cancer.org.uk>
Message-ID: <56936.81.98.241.17.1196867306.squirrel@webmail.ebi.ac.uk>

Hello Aengus,

Restrict will report enzymes with the same recognition site
if the source REBASE database lists them as having different
cut sites. That appears to be the case with your reported output.
So, you do seem to be using it correctly and the results also seem to
be correct.

Alan


>
> I seem to be having trouble with restrict not picking up -limit or am I
> not using it correctly?
>
> I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI
> and EcoRII ???
>
> ########################################
> # Program: restrict
> # Rundate: Wed  5 Dec 2007 13:17:08
> # Commandline: restrict
> #    -sitelen 4
> #    -enzymes all
> #    -limit
> #    -blunt
> #    -single
> #    [-sequence] rs9584819.ff
> #    -outfile
> /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> # Report_format: table
> # Report_file:
> /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> ########################################
>
> #=======================================
> #
> # Sequence: rs9584819     from: 1   to: 101
> # HitCount: 24
> #
> # Minimum cuts per enzyme: 1
> # Maximum cuts per enzyme: 1
> # Minimum length of recognition site: 4
> # Blunt ends allowed
> # Sticky ends allowed
> # DNA is linear
> # Ambiguities allowed
> #
> #=======================================
>
>    Start     End Enzyme_name Restriction_site 5prime 3prime 5primerev
> 3primerev
>       13      17 BssKI       CCNGG                12     17         .
>    .
>       13      17 BseBI       CCWGG                14     15         .
>    .
>       13      17 ScrFI       CCNGG                14     15         .
>    .
>       13      17 EcoRII      CCWGG                12     17         .
>    .
>
>
>
>
> Regards
> Aengus
>
>
>
> --
> -----------------------------------------------------------------------
> Aengus Stewart
> Head of Bioinformatics and BioStatistics
> Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
> Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
> -----------------------------------------------------------------------
>
> This electronic message contains information which may be privileged and
> confidential.  The information is intended to be for the use of the
> individual(s) or entity named above. Be aware that any third party
> disclosure, distribution, copying or use of this communication, without
> prior permission, is strictly prohibited.
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From gbottu at vub.ac.be  Wed Dec  5 10:30:25 2007
From: gbottu at vub.ac.be (Guy Bottu)
Date: Wed, 05 Dec 2007 16:30:25 +0100
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756A549.1030303@cancer.org.uk>
References: <4756A549.1030303@cancer.org.uk>
Message-ID: <4756C411.8090101@vub.ac.be>

Aengus Stewart wrote:
> I seem to be having trouble with restrict not picking up -limit or am I not using it correctly?

restrict by default searches only for prototype enzymes ; if you want to see all 
enzymes you must explicitly set -nolimit. I however notice that also at our site 
the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI
and BseBI, while it should. Maybe there is a bug in the program rebaseextract or 
some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ?

	Guy Bottu,
	Belgian EMBnet Node

From pmr at ebi.ac.uk  Wed Dec  5 10:57:59 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 05 Dec 2007 15:57:59 +0000
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756C411.8090101@vub.ac.be>
References: <4756A549.1030303@cancer.org.uk> <4756C411.8090101@vub.ac.be>
Message-ID: <4756CA87.2080206@ebi.ac.uk>

Guy Bottu wrote:
> I however notice that also at our site 
> the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI
> and BseBI, while it should. Maybe there is a bug in the program rebaseextract or 
> some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ?

Which version of REBASE did you use for rebaseextract?

Peter

From sum732 at mail.usask.ca  Fri Dec  7 18:01:43 2007
From: sum732 at mail.usask.ca (Sudeep Mehrotra)
Date: Fri, 07 Dec 2007 17:01:43 -0600
Subject: [EMBOSS] Emboss-Digest
Message-ID: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca>

Hello,
I used "digest" from EMBOSS to digest protein database obtained from  
NCBI REFSEQ.
Here is how I executed digest:
digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name"
 From the list I selected trypsin
For some reason, digest skipped (no fragments were generated) for this  
particular protein
 >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis  
stolonifera]
MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR

any ideas?

I should get two fragments. I don't want to see the partial digests so  
that is why I never selected the option.

Thanks
Sudeep

From ajb at ebi.ac.uk  Fri Dec  7 20:13:31 2007
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Sat, 8 Dec 2007 01:13:31 -0000 (GMT)
Subject: [EMBOSS] Emboss-Digest
In-Reply-To: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca>
References: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca>
Message-ID: <34776.81.98.241.17.1197076411.squirrel@webmail.ebi.ac.uk>

Dear Sudeep,

Trypsin doesn't cut as well if (e.g.) the K is followed by any of
"KRIFLP" (Prof. D. Pappin, personal comm). Your sequence contains
"...KL..." so there is no cut. If you want unfavoured cuts to be
shown (e.g. a cut after every K for trypsin) then add the flag
"-unfavoured" to the command line.

HTH

Alan


> Hello,
> I used "digest" from EMBOSS to digest protein database obtained from
> NCBI REFSEQ.
> Here is how I executed digest:
> digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name"
>  From the list I selected trypsin
> For some reason, digest skipped (no fragments were generated) for this
> particular protein
>  >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis
> stolonifera]
> MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR
>
> any ideas?
>
> I should get two fragments. I don't want to see the partial digests so
> that is why I never selected the option.
>
> Thanks
> Sudeep
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From mike.thon at gmail.com  Wed Dec 12 05:24:34 2007
From: mike.thon at gmail.com (Michael Thon)
Date: Wed, 12 Dec 2007 11:24:34 +0100
Subject: [EMBOSS] EMBOSS database queries
Message-ID: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>

I am setting up a database from Genbank formatted files.  I understand  
how to index the db and configure the emboss.default file but I don't  
know how to construct the queries.  queries for sequence IDs are  
pretty simple, i.e. with a USA of the format "dbname:id".  But, how to  
I create a query for the other fields, such as org and key?  Also, do  
these fields support wildcards or substring matches or other fancy  
stuff?
cheers
Mike

From pmr at ebi.ac.uk  Wed Dec 12 06:21:51 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 12 Dec 2007 11:21:51 +0000
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
Message-ID: <475FC44F.7030700@ebi.ac.uk>

Michael Thon wrote:
> I am setting up a database from Genbank formatted files.  I understand 
> how to index the db and configure the emboss.default file but I don't 
> know how to construct the queries.  queries for sequence IDs are pretty 
> simple, i.e. with a USA of the format "dbname:id".  But, how to I create 
> a query for the other fields, such as org and key?  Also, do these 
> fields support wildcards or substring matches or other fancy stuff?

Assuming you indexed all the fields (by default ID and ACC are indexed)
you use the same syntax as in srs (we saw no need to invent a new
syntax, so we used the same field name abbreviations but we did drop the
'[]' around the query :-)

dbname-acc:x13776
dbname-org:pseudomonas*
dbname-des:amidase
dbname-key:
dbname-sv:
dbname-gi:

and, to complete the set, dbname-id:x13776

As you see, wildcards are allowed with '*' at the end.

We can make this much more sophisticated, allowing more wildcard options
and combining queries. So far EMBOSS users have been content to use SRS
or alternatives (MRS for example).

If there is interest, we can extend the USA to include wildcards,
AND/OR/NOT, search multiple fields, combine databases, and if we get
really ambitious we could include links between databases.

We will have to be careful to restrict some of these extensions to
database access methods that support them.

Hope this helps,

Peter

From mike.thon at gmail.com  Wed Dec 12 11:12:05 2007
From: mike.thon at gmail.com (Michael Thon)
Date: Wed, 12 Dec 2007 17:12:05 +0100
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <475FC44F.7030700@ebi.ac.uk>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
	<475FC44F.7030700@ebi.ac.uk>
Message-ID: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>

Thanks Peter, I got it working.
While I'm at it, a couple more questions popped up:
1) do you know if  these indexes compatible with the Bio::DB::Registry  
type databases?
2) Is there any way to index and search sequence features?
Best
Mike


On Dec 12, 2007, at 12:21 PM, Peter Rice wrote:

> Michael Thon wrote:
>> I am setting up a database from Genbank formatted files.  I  
>> understand how to index the db and configure the emboss.default  
>> file but I don't know how to construct the queries.  queries for  
>> sequence IDs are pretty simple, i.e. with a USA of the format  
>> "dbname:id".  But, how to I create a query for the other fields,  
>> such as org and key?  Also, do these fields support wildcards or  
>> substring matches or other fancy stuff?
>
> Assuming you indexed all the fields (by default ID and ACC are  
> indexed)
> you use the same syntax as in srs (we saw no need to invent a new
> syntax, so we used the same field name abbreviations but we did drop  
> the
> '[]' around the query :-)
>
> dbname-acc:x13776
> dbname-org:pseudomonas*
> dbname-des:amidase
> dbname-key:
> dbname-sv:
> dbname-gi:
>
> and, to complete the set, dbname-id:x13776
>
> As you see, wildcards are allowed with '*' at the end.
>
> We can make this much more sophisticated, allowing more wildcard  
> options
> and combining queries. So far EMBOSS users have been content to use  
> SRS
> or alternatives (MRS for example).
>
> If there is interest, we can extend the USA to include wildcards,
> AND/OR/NOT, search multiple fields, combine databases, and if we get
> really ambitious we could include links between databases.
>
> We will have to be careful to restrict some of these extensions to
> database access methods that support them.
>
> Hope this helps,
>
> Peter


From pmr at ebi.ac.uk  Wed Dec 12 11:20:31 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 12 Dec 2007 16:20:31 +0000
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
	<475FC44F.7030700@ebi.ac.uk>
	<4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>
Message-ID: <47600A4F.8030107@ebi.ac.uk>

Michael Thon wrote:
> Thanks Peter, I got it working.
> While I'm at it, a couple more questions popped up:
> 1) do you know if  these indexes compatible with the Bio::DB::Registry 
> type databases?

No ... well, we could add Bio::DB indices to the things EMBOSS can 
retrieve, then they would be :-)

> 2) Is there any way to index and search sequence features?

Not at present - but:

2a. what would you like to search for ...
2b. what would you like as the result ...
   2b.i. if you want features, what do we call them?

regards,

Peter

From mike.thon at gmail.com  Fri Dec 14 12:28:59 2007
From: mike.thon at gmail.com (Michael Thon)
Date: Fri, 14 Dec 2007 18:28:59 +0100
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <47600A4F.8030107@ebi.ac.uk>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
	<475FC44F.7030700@ebi.ac.uk>
	<4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>
	<47600A4F.8030107@ebi.ac.uk>
Message-ID: <2F72B57C-A8C4-4A8E-84B6-5793764DBDD4@gmail.com>


On Dec 12, 2007, at 5:20 PM, Peter Rice wrote:

> Michael Thon wrote:
>> Thanks Peter, I got it working.
>> While I'm at it, a couple more questions popped up:
>> 1) do you know if  these indexes compatible with the  
>> Bio::DB::Registry type databases?
>
> No ... well, we could add Bio::DB indices to the things EMBOSS can  
> retrieve, then they would be :-)
>
>> 2) Is there any way to index and search sequence features?
>
> Not at present - but:
>
> 2a. what would you like to search for ...
> 2b. what would you like as the result ...
>  2b.i. if you want features, what do we call them?
>
Actually, I haven't given it much thought.  But, for starters, one  
might want to retrieve proteins containing domain X, or that are  
annotated with interpro term Y.  Perhaps some of this functionality  
could be accomplished though clever use of the key or des fields i.e.  
by putting all the Interpro terms assigned to a protein in the keyword  
field prior to indexing.

One might also want to query a database of genomic DNA and fetch a  
translation of a gene or its spliced CDS.
best
Mike


From bernd.web at gmail.com  Mon Dec 17 15:01:32 2007
From: bernd.web at gmail.com (Bernd Web)
Date: Mon, 17 Dec 2007 21:01:32 +0100
Subject: [EMBOSS] iep/gifasta
Message-ID: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com>

Hi,

I'd like to run iep on a sequence and use either pir or osformat gifasta.
The following gives an error (using emboss 5.0.0 on Debian):

iep -filter -osformat gifasta -sequence seq.txt
This returns "Died: Unknown qualifier -osformat"

iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt
also give an error:
"Died: iep terminated: Bad value for '-sequence' with -auto defined"
(with or without the sequence flag)

However, iep -sformat fasta seq.txt works. What am I doing wrong?

I'd like output to contain the accession number. I thought -osformat
gifasta was for this purpose.
My FastA definition line is e.g.
>ENSG00000205090|1|protein_coding.
The IEP report would me more useful if it contains the ENSG number
instead of "protein coding or the entire definition line.

How to do this?


Kind regards,
Bernd

From pmr at ebi.ac.uk  Tue Dec 18 04:23:18 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 18 Dec 2007 09:23:18 +0000
Subject: [EMBOSS] iep/gifasta
In-Reply-To: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com>
References: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com>
Message-ID: <47679186.6020003@ebi.ac.uk>

Hi Bernd,

Bernd Web wrote:
> Hi,
> 
> I'd like to run iep on a sequence and use either pir or osformat gifasta.
> The following gives an error (using emboss 5.0.0 on Debian):
> 
> iep -filter -osformat gifasta -sequence seq.txt
> This returns "Died: Unknown qualifier -osformat"

-osformat is for sequence outputs (and iep has no sequence outputs)

iep writes a plain text file as output and no special options
but we will add more information (accession and description) for a 
future release ... and to other plain text output files too.

> iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt
> also give an error:
> "Died: iep terminated: Bad value for '-sequence' with -auto defined"
> (with or without the sequence flag)
> 
> However, iep -sformat fasta seq.txt works. What am I doing wrong?

It appears your sequence can be read in fasta format but not in pir 
format. PIR format has special characters after the first '>'

> My FastA definition line is e.g.
>> ENSG00000205090|1|protein_coding.
> The IEP report would me more useful if it contains the ENSG number
> instead of "protein coding or the entire definition line.

Not a nice format. NCBI made up a lot of FASTA file identifiers with '|' 
characters and we try to follow their rules. That causes us to ignore 
the first part (it should be a database name) and reas the ID from the end.

You could reformat the FASTA files (e.g. with a perl script) to remove 
the '|' characters and leave something useful as the plain ID (perhaps 
ENSG00000205090_1 in this case) and the rest as description.

Hope that helps,

Peter Rice


From peter.robinson at t-online.de  Thu Dec 20 10:08:59 2007
From: peter.robinson at t-online.de (Peter Robinson)
Date: Thu, 20 Dec 2007 16:08:59 +0100
Subject: [EMBOSS] Seqall Datatype
Message-ID: <476A858B.9080403@t-online.de>

Dear EMBOSSERs,

I am trying my hand at an EMBOSS program and would like to read in a 
list of sequences from a FASTA file and make pairwise comparisons 
between each sequence.  If I startwith a AjPSeqall object

AjPSeqall seqs=NULL;
seqs = ajAcdGetSeqall ("seqs");

I have seen

AjPSeq seq;
   
while(ajSeqallNext(seqs, &seq)) {

}

in the documentation, but I would like to do something like a double for 
loop to get all pairwise comparisons. What is the best way of doing 
this? I have been searching in the online docs but did not yet find 
anything.

By the way, in http://emboss.sourceforge.net/developers/program.html

*17.2 Getting information from a sequence*

*ajSeqGetName* get the name. This is a pointer to the internal AjPStr

*ajSeqName* get the name. This is a pointer to the internal char*

these datatypes are flagged as obsolete by the compiler, so the document 
may need revision here?

Thanks,
Peter Robinson


From pmr at ebi.ac.uk  Thu Dec 20 11:07:18 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 20 Dec 2007 16:07:18 +0000
Subject: [EMBOSS] Seqall Datatype
In-Reply-To: <476A858B.9080403@t-online.de>
References: <476A858B.9080403@t-online.de>
Message-ID: <476A9336.9090504@ebi.ac.uk>

Dear Peter,

> I am trying my hand at an EMBOSS program and would like to read in a 
> list of sequences from a FASTA file and make pairwise comparisons 
> between each sequence.  If I startwith a AjPSeqall object
> 
> AjPSeqall seqs=NULL;
> seqs = ajAcdGetSeqall ("seqs");

You want all the sequences in memory so you can work through the pairs - 
better to use AjPSeqset and ajAcdGetSeqset

> in the documentation, but I would like to do something like a double for 
> loop to get all pairwise comparisons. What is the best way of doing 
> this? I have been searching in the online docs but did not yet find 
> anything.

distmat has the kind of loop you are looking for (except it does self 
matches too)


> By the way, in http://emboss.sourceforge.net/developers/program.html
> 
> these datatypes are flagged as obsolete by the compiler, so the document 
> may need revision here?

Yes, all being revised for the books we are preparing ... we will take a 
look through program.html and make some basic updates to correct these 
things.

regards,

Peter

From peter.robinson at t-online.de  Thu Dec 20 12:00:38 2007
From: peter.robinson at t-online.de (Peter Robinson)
Date: Thu, 20 Dec 2007 18:00:38 +0100
Subject: [EMBOSS] Seqall Datatype
In-Reply-To: <476A9336.9090504@ebi.ac.uk>
References: <476A858B.9080403@t-online.de> <476A9336.9090504@ebi.ac.uk>
Message-ID: <476A9FB6.6080407@t-online.de>

Peter Rice wrote:
> Dear Peter,
>
>> I am trying my hand at an EMBOSS program and would like to read in a 
>> list of sequences from a FASTA file and make pairwise comparisons 
>> between each sequence.  If I startwith a AjPSeqall object
>>
>> AjPSeqall seqs=NULL;
>> seqs = ajAcdGetSeqall ("seqs");
>
> You want all the sequences in memory so you can work through the pairs 
> - better to use AjPSeqset and ajAcdGetSeqset
>
>> in the documentation, but I would like to do something like a double 
>> for loop to get all pairwise comparisons. What is the best way of 
>> doing this? I have been searching in the online docs but did not yet 
>> find anything.
>
> distmat has the kind of loop you are looking for (except it does self 
> matches too)
>
>
>> By the way, in http://emboss.sourceforge.net/developers/program.html
>>
>> these datatypes are flagged as obsolete by the compiler, so the 
>> document may need revision here?
>
> Yes, all being revised for the books we are preparing ... we will take 
> a look through program.html and make some basic updates to correct 
> these things.
>
> regards,
>
> Peter
>
Dear Peter,
thanks for the tip, that was just what I needed!
best wishes for the holidays!
Peter

From staffa at niehs.nih.gov  Thu Dec 20 16:44:02 2007
From: staffa at niehs.nih.gov (Staffa, Nick (NIH/NIEHS))
Date: Thu, 20 Dec 2007 16:44:02 -0500
Subject: [EMBOSS] newcpgreport
Message-ID: <C3904C52.77F8%staffa@niehs.nih.gov>

I have been using EMBOSS newcpgreport by
Rodrigo Lopez (rls ? ebi.ac.uk)
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK

http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html
says:
By default, this program defines a CpG island as a region where, over an
average of 10 windows, the calculated % composition is over 50% and the
calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum
of 200 bases. These conditions can be modified by setting the values of the
appropriate parameters.

I may be very dull and unimaginative, but I'd sure like a more detailed
explanation of what the program is doing to define a CpG island.
Does anyone know where this might be found?
Or even the code.

Can anyone help please.

Thanks
 
Nick Staffa 
Telephone: 919-316-4569  (NIEHS: 6-4569)
Scientific Computing Support Group
NIEHS Information Technology Support Services Contract
(Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov)
National Institute of Environmental Health Sciences
National Institutes of Health
Research Triangle Park, North Carolina


From pmr at ebi.ac.uk  Fri Dec 21 04:09:18 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 21 Dec 2007 09:09:18 +0000
Subject: [EMBOSS] newcpgreport
In-Reply-To: <C3904C52.77F8%staffa@niehs.nih.gov>
References: <C3904C52.77F8%staffa@niehs.nih.gov>
Message-ID: <476B82BE.7000602@ebi.ac.uk>

Staffa, Nick (NIH/NIEHS) wrote:
> I have been using EMBOSS newcpgreport by
> Rodrigo Lopez (rls ? ebi.ac.uk)
> 
> I may be very dull and unimaginative, but I'd sure like a more detailed
> explanation of what the program is doing to define a CpG island.
> Does anyone know where this might be found?
> Or even the code.

The code is included in EMBOSS (as emboss/newcpgreport.c)

The original reference for the CpG island criteria is in the paper 
listed in the "references" section of the newcpgreport documentation.

Larsen F., Gundersen, G., Lopez L., Prydz H. "CpG island as Gene Markers 
in the Human Genome" Genomics 13:1095-1107 (1992)

If memory serves, this refers to earlier work by Gardiner-Garden.

If you need more information I am just along the corridor from Rodrigo's 
office ... once we're both back after Xmas :-)

Hope that helps,

Peter

From rls at ebi.ac.uk  Fri Dec 21 04:11:05 2007
From: rls at ebi.ac.uk (Rodrigo Lopez)
Date: Fri, 21 Dec 2007 09:11:05 +0000
Subject: [EMBOSS] newcpgreport
In-Reply-To: <C3904C52.77F8%staffa@niehs.nih.gov>
References: <C3904C52.77F8%staffa@niehs.nih.gov>
Message-ID: <476B8329.9060007@ebi.ac.uk>

Hi,

The relevant papers describing the method in detail are:

PubMed:3656447
Gardiner-Garden M., Frommer M.
CpG islands in vertebrate genomes.
(20-Jul-1987) Journal of molecular biology, 196 (2) :261-82

PubMed:1505946
Larsen F., Gundersen G., Lopez R., Prydz H.
CpG islands as gene markers in the human genome.
(Aug-1992) Genomics, 13 (4) :1095-107

The source code - currently maintained by the EMBOSS team - is in the 
EMBOSS distribution.  See your <yourdir>/EMBOSS-5.0.0/emboss/newcpgreport.c

Hope this helps. Please do not hesitate to contact me if you have 
further queries.

R:)


Staffa, Nick (NIH/NIEHS) wrote:
> I have been using EMBOSS newcpgreport by
> Rodrigo Lopez (rls ? ebi.ac.uk)
> European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,
> Cambridge CB10 1SD, UK
> 
> http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html
> says:
> By default, this program defines a CpG island as a region where, over an
> average of 10 windows, the calculated % composition is over 50% and the
> calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum
> of 200 bases. These conditions can be modified by setting the values of the
> appropriate parameters.
> 
> I may be very dull and unimaginative, but I'd sure like a more detailed
> explanation of what the program is doing to define a CpG island.
> Does anyone know where this might be found?
> Or even the code.
> 
> Can anyone help please.
> 
> Thanks
>  
> Nick Staffa 
> Telephone: 919-316-4569  (NIEHS: 6-4569)
> Scientific Computing Support Group
> NIEHS Information Technology Support Services Contract
> (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov)
> National Institute of Environmental Health Sciences
> National Institutes of Health
> Research Triangle Park, North Carolina
> 
> 

From gbottu at vub.ac.be  Wed Dec 26 09:46:40 2007
From: gbottu at vub.ac.be (Guy Bottu)
Date: Wed, 26 Dec 2007 15:46:40 +0100
Subject: [EMBOSS] extractalign
Message-ID: <47726950.6070007@vub.ac.be>

	Dear all,

I just noticed that EMBOSS version 5 contains a program extractalign, which
extracts ranges from a multiple sequence alignment. This is certainly an
interesting tool. The program is however not accompanied by an on-line 
manual
and it is not mentioned in the Changelog. Any comment fom the developers ?

	Happy Christmas to you all,
	Guy Bottu,
	BEN


From david at compbio.dundee.ac.uk  Thu Dec 27 06:31:41 2007
From: david at compbio.dundee.ac.uk (David Martin)
Date: Thu, 27 Dec 2007 11:31:41 +0000
Subject: [EMBOSS] Identifying sequence formats.
Message-ID: <C3993D9D.2D2E0%david@compbio.dundee.ac.uk>

Is there an easy way of identifying the format of a sequence using EMBOSS?
It does wonderful autodetect but I'd like to be able to find out what it
thinks the sequence format is for an arbitrary sequence.

regards

..d


-- 
David Martin PhD
Post-Genomics and Molecular Interactions Centre
University of Dundee
http://www.compbio.dundee.ac.uk/
 

From pmr at ebi.ac.uk  Fri Dec 28 05:20:27 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 28 Dec 2007 10:20:27 +0000
Subject: [EMBOSS] Identifying sequence formats.
In-Reply-To: <C3993D9D.2D2E0%david@compbio.dundee.ac.uk>
References: <C3993D9D.2D2E0%david@compbio.dundee.ac.uk>
Message-ID: <4774CDEB.4040407@ebi.ac.uk>

David Martin wrote:
> Is there an easy way of identifying the format of a sequence using EMBOSS?
> It does wonderful autodetect but I'd like to be able to find out what it
> thinks the sequence format is for an arbitrary sequence.

The information is stored so you can craft a little application to print 
out the value of the FormatStr attribute.

There may be some oddities .... it automatically switches between 
EMBL/SwissProt and FASTA/NCBI formats depending on the first line. Let 
us know and we can look to apply corrections.

Season's greetings and all the best for the New Year

Peter


From pmr at ebi.ac.uk  Fri Dec 28 05:37:07 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 28 Dec 2007 10:37:07 +0000
Subject: [EMBOSS] extractalign
In-Reply-To: <47726950.6070007@vub.ac.be>
References: <47726950.6070007@vub.ac.be>
Message-ID: <4774D1D3.4050503@ebi.ac.uk>

Guy Bottu wrote:
>     Dear all,
> 
> I just noticed that EMBOSS version 5 contains a program extractalign, which
> extracts ranges from a multiple sequence alignment. This is certainly an
> interesting tool. The program is however not accompanied by an on-line 
> manual
> and it is not mentioned in the Changelog. Any comment fom the developers ?

Well ... it is accompanied by an online manual .... just not included in 
the programs index.

edialign and wordfinder were also missing.

Now to update the ChangeLog (wordfinder is missing there too)...

Season's greetings and Happy New Year

Peter

From aengus.stewart at cancer.org.uk  Wed Dec  5 13:19:05 2007
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Wed, 05 Dec 2007 13:19:05 +0000
Subject: [EMBOSS] restrict -limit
Message-ID: <4756A549.1030303@cancer.org.uk>


I seem to be having trouble with restrict not picking up -limit or am I not using it correctly?

I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ???

########################################
# Program: restrict
# Rundate: Wed  5 Dec 2007 13:17:08
# Commandline: restrict
#    -sitelen 4
#    -enzymes all
#    -limit
#    -blunt
#    -single
#    [-sequence] rs9584819.ff
#    -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
# Report_format: table
# Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
########################################

#=======================================
#
# Sequence: rs9584819     from: 1   to: 101
# HitCount: 24
#
# Minimum cuts per enzyme: 1
# Maximum cuts per enzyme: 1
# Minimum length of recognition site: 4
# Blunt ends allowed
# Sticky ends allowed
# DNA is linear
# Ambiguities allowed
#
#=======================================

   Start     End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev
      13      17 BssKI       CCNGG                12     17         .         .
      13      17 BseBI       CCWGG                14     15         .         .
      13      17 ScrFI       CCNGG                14     15         .         .
      13      17 EcoRII      CCWGG                12     17         .         .


Regards
Aengus


-- 
-----------------------------------------------------------------------
Aengus Stewart
Head of Bioinformatics and BioStatistics
Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.


From aengus.stewart at cancer.org.uk  Wed Dec  5 15:00:50 2007
From: aengus.stewart at cancer.org.uk (Aengus Stewart)
Date: Wed, 05 Dec 2007 15:00:50 +0000
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756A549.1030303@cancer.org.uk>
References: <4756A549.1030303@cancer.org.uk>
Message-ID: <4756BD22.6070902@cancer.org.uk>


Yeah I know, not one of my brightest days...............

Helps to look at cut position as well as motif *sigh*


Aengus


Aengus Stewart wrote:
> I seem to be having trouble with restrict not picking up -limit or am I not using it correctly?
> 
> I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ???
> 
> ########################################
> # Program: restrict
> # Rundate: Wed  5 Dec 2007 13:17:08
> # Commandline: restrict
> #    -sitelen 4
> #    -enzymes all
> #    -limit
> #    -blunt
> #    -single
> #    [-sequence] rs9584819.ff
> #    -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> # Report_format: table
> # Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> ########################################
> 
> #=======================================
> #
> # Sequence: rs9584819     from: 1   to: 101
> # HitCount: 24
> #
> # Minimum cuts per enzyme: 1
> # Maximum cuts per enzyme: 1
> # Minimum length of recognition site: 4
> # Blunt ends allowed
> # Sticky ends allowed
> # DNA is linear
> # Ambiguities allowed
> #
> #=======================================
> 
>    Start     End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev
>       13      17 BssKI       CCNGG                12     17         .         .
>       13      17 BseBI       CCWGG                14     15         .         .
>       13      17 ScrFI       CCNGG                14     15         .         .
>       13      17 EcoRII      CCWGG                12     17         .         .
> 
> 
> 
> 
> Regards
> Aengus
> 
> 
> 


-- 
-----------------------------------------------------------------------
Aengus Stewart
Head of Bioinformatics and BioStatistics
Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
-----------------------------------------------------------------------

This electronic message contains information which may be privileged and
confidential.  The information is intended to be for the use of the
individual(s) or entity named above. Be aware that any third party
disclosure, distribution, copying or use of this communication, without
prior permission, is strictly prohibited.


From ajb at ebi.ac.uk  Wed Dec  5 15:08:26 2007
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Wed, 5 Dec 2007 15:08:26 -0000 (GMT)
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756A549.1030303@cancer.org.uk>
References: <4756A549.1030303@cancer.org.uk>
Message-ID: <56936.81.98.241.17.1196867306.squirrel@webmail.ebi.ac.uk>

Hello Aengus,

Restrict will report enzymes with the same recognition site
if the source REBASE database lists them as having different
cut sites. That appears to be the case with your reported output.
So, you do seem to be using it correctly and the results also seem to
be correct.

Alan


>
> I seem to be having trouble with restrict not picking up -limit or am I
> not using it correctly?
>
> I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI
> and EcoRII ???
>
> ########################################
> # Program: restrict
> # Rundate: Wed  5 Dec 2007 13:17:08
> # Commandline: restrict
> #    -sitelen 4
> #    -enzymes all
> #    -limit
> #    -blunt
> #    -single
> #    [-sequence] rs9584819.ff
> #    -outfile
> /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> # Report_format: table
> # Report_file:
> /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict
> ########################################
>
> #=======================================
> #
> # Sequence: rs9584819     from: 1   to: 101
> # HitCount: 24
> #
> # Minimum cuts per enzyme: 1
> # Maximum cuts per enzyme: 1
> # Minimum length of recognition site: 4
> # Blunt ends allowed
> # Sticky ends allowed
> # DNA is linear
> # Ambiguities allowed
> #
> #=======================================
>
>    Start     End Enzyme_name Restriction_site 5prime 3prime 5primerev
> 3primerev
>       13      17 BssKI       CCNGG                12     17         .
>    .
>       13      17 BseBI       CCWGG                14     15         .
>    .
>       13      17 ScrFI       CCNGG                14     15         .
>    .
>       13      17 EcoRII      CCWGG                12     17         .
>    .
>
>
>
>
> Regards
> Aengus
>
>
>
> --
> -----------------------------------------------------------------------
> Aengus Stewart
> Head of Bioinformatics and BioStatistics
> Bioinformatics and BioStatistics               Tel: +44 (0)20 7269 3679
> Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK
> -----------------------------------------------------------------------
>
> This electronic message contains information which may be privileged and
> confidential.  The information is intended to be for the use of the
> individual(s) or entity named above. Be aware that any third party
> disclosure, distribution, copying or use of this communication, without
> prior permission, is strictly prohibited.
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From gbottu at vub.ac.be  Wed Dec  5 15:30:25 2007
From: gbottu at vub.ac.be (Guy Bottu)
Date: Wed, 05 Dec 2007 16:30:25 +0100
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756A549.1030303@cancer.org.uk>
References: <4756A549.1030303@cancer.org.uk>
Message-ID: <4756C411.8090101@vub.ac.be>

Aengus Stewart wrote:
> I seem to be having trouble with restrict not picking up -limit or am I not using it correctly?

restrict by default searches only for prototype enzymes ; if you want to see all 
enzymes you must explicitly set -nolimit. I however notice that also at our site 
the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI
and BseBI, while it should. Maybe there is a bug in the program rebaseextract or 
some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ?

	Guy Bottu,
	Belgian EMBnet Node


From pmr at ebi.ac.uk  Wed Dec  5 15:57:59 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 05 Dec 2007 15:57:59 +0000
Subject: [EMBOSS] restrict -limit
In-Reply-To: <4756C411.8090101@vub.ac.be>
References: <4756A549.1030303@cancer.org.uk> <4756C411.8090101@vub.ac.be>
Message-ID: <4756CA87.2080206@ebi.ac.uk>

Guy Bottu wrote:
> I however notice that also at our site 
> the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI
> and BseBI, while it should. Maybe there is a bug in the program rebaseextract or 
> some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ?

Which version of REBASE did you use for rebaseextract?

Peter


From sum732 at mail.usask.ca  Fri Dec  7 23:01:43 2007
From: sum732 at mail.usask.ca (Sudeep Mehrotra)
Date: Fri, 07 Dec 2007 17:01:43 -0600
Subject: [EMBOSS] Emboss-Digest
Message-ID: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca>

Hello,
I used "digest" from EMBOSS to digest protein database obtained from  
NCBI REFSEQ.
Here is how I executed digest:
digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name"
 From the list I selected trypsin
For some reason, digest skipped (no fragments were generated) for this  
particular protein
 >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis  
stolonifera]
MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR

any ideas?

I should get two fragments. I don't want to see the partial digests so  
that is why I never selected the option.

Thanks
Sudeep


From ajb at ebi.ac.uk  Sat Dec  8 01:13:31 2007
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Sat, 8 Dec 2007 01:13:31 -0000 (GMT)
Subject: [EMBOSS] Emboss-Digest
In-Reply-To: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca>
References: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca>
Message-ID: <34776.81.98.241.17.1197076411.squirrel@webmail.ebi.ac.uk>

Dear Sudeep,

Trypsin doesn't cut as well if (e.g.) the K is followed by any of
"KRIFLP" (Prof. D. Pappin, personal comm). Your sequence contains
"...KL..." so there is no cut. If you want unfavoured cuts to be
shown (e.g. a cut after every K for trypsin) then add the flag
"-unfavoured" to the command line.

HTH

Alan


> Hello,
> I used "digest" from EMBOSS to digest protein database obtained from
> NCBI REFSEQ.
> Here is how I executed digest:
> digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name"
>  From the list I selected trypsin
> For some reason, digest skipped (no fragments were generated) for this
> particular protein
>  >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis
> stolonifera]
> MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR
>
> any ideas?
>
> I should get two fragments. I don't want to see the partial digests so
> that is why I never selected the option.
>
> Thanks
> Sudeep
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From mike.thon at gmail.com  Wed Dec 12 10:24:34 2007
From: mike.thon at gmail.com (Michael Thon)
Date: Wed, 12 Dec 2007 11:24:34 +0100
Subject: [EMBOSS] EMBOSS database queries
Message-ID: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>

I am setting up a database from Genbank formatted files.  I understand  
how to index the db and configure the emboss.default file but I don't  
know how to construct the queries.  queries for sequence IDs are  
pretty simple, i.e. with a USA of the format "dbname:id".  But, how to  
I create a query for the other fields, such as org and key?  Also, do  
these fields support wildcards or substring matches or other fancy  
stuff?
cheers
Mike


From pmr at ebi.ac.uk  Wed Dec 12 11:21:51 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 12 Dec 2007 11:21:51 +0000
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
Message-ID: <475FC44F.7030700@ebi.ac.uk>

Michael Thon wrote:
> I am setting up a database from Genbank formatted files.  I understand 
> how to index the db and configure the emboss.default file but I don't 
> know how to construct the queries.  queries for sequence IDs are pretty 
> simple, i.e. with a USA of the format "dbname:id".  But, how to I create 
> a query for the other fields, such as org and key?  Also, do these 
> fields support wildcards or substring matches or other fancy stuff?

Assuming you indexed all the fields (by default ID and ACC are indexed)
you use the same syntax as in srs (we saw no need to invent a new
syntax, so we used the same field name abbreviations but we did drop the
'[]' around the query :-)

dbname-acc:x13776
dbname-org:pseudomonas*
dbname-des:amidase
dbname-key:
dbname-sv:
dbname-gi:

and, to complete the set, dbname-id:x13776

As you see, wildcards are allowed with '*' at the end.

We can make this much more sophisticated, allowing more wildcard options
and combining queries. So far EMBOSS users have been content to use SRS
or alternatives (MRS for example).

If there is interest, we can extend the USA to include wildcards,
AND/OR/NOT, search multiple fields, combine databases, and if we get
really ambitious we could include links between databases.

We will have to be careful to restrict some of these extensions to
database access methods that support them.

Hope this helps,

Peter


From mike.thon at gmail.com  Wed Dec 12 16:12:05 2007
From: mike.thon at gmail.com (Michael Thon)
Date: Wed, 12 Dec 2007 17:12:05 +0100
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <475FC44F.7030700@ebi.ac.uk>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
	<475FC44F.7030700@ebi.ac.uk>
Message-ID: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>

Thanks Peter, I got it working.
While I'm at it, a couple more questions popped up:
1) do you know if  these indexes compatible with the Bio::DB::Registry  
type databases?
2) Is there any way to index and search sequence features?
Best
Mike


On Dec 12, 2007, at 12:21 PM, Peter Rice wrote:

> Michael Thon wrote:
>> I am setting up a database from Genbank formatted files.  I  
>> understand how to index the db and configure the emboss.default  
>> file but I don't know how to construct the queries.  queries for  
>> sequence IDs are pretty simple, i.e. with a USA of the format  
>> "dbname:id".  But, how to I create a query for the other fields,  
>> such as org and key?  Also, do these fields support wildcards or  
>> substring matches or other fancy stuff?
>
> Assuming you indexed all the fields (by default ID and ACC are  
> indexed)
> you use the same syntax as in srs (we saw no need to invent a new
> syntax, so we used the same field name abbreviations but we did drop  
> the
> '[]' around the query :-)
>
> dbname-acc:x13776
> dbname-org:pseudomonas*
> dbname-des:amidase
> dbname-key:
> dbname-sv:
> dbname-gi:
>
> and, to complete the set, dbname-id:x13776
>
> As you see, wildcards are allowed with '*' at the end.
>
> We can make this much more sophisticated, allowing more wildcard  
> options
> and combining queries. So far EMBOSS users have been content to use  
> SRS
> or alternatives (MRS for example).
>
> If there is interest, we can extend the USA to include wildcards,
> AND/OR/NOT, search multiple fields, combine databases, and if we get
> really ambitious we could include links between databases.
>
> We will have to be careful to restrict some of these extensions to
> database access methods that support them.
>
> Hope this helps,
>
> Peter


From pmr at ebi.ac.uk  Wed Dec 12 16:20:31 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 12 Dec 2007 16:20:31 +0000
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
	<475FC44F.7030700@ebi.ac.uk>
	<4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>
Message-ID: <47600A4F.8030107@ebi.ac.uk>

Michael Thon wrote:
> Thanks Peter, I got it working.
> While I'm at it, a couple more questions popped up:
> 1) do you know if  these indexes compatible with the Bio::DB::Registry 
> type databases?

No ... well, we could add Bio::DB indices to the things EMBOSS can 
retrieve, then they would be :-)

> 2) Is there any way to index and search sequence features?

Not at present - but:

2a. what would you like to search for ...
2b. what would you like as the result ...
   2b.i. if you want features, what do we call them?

regards,

Peter


From mike.thon at gmail.com  Fri Dec 14 17:28:59 2007
From: mike.thon at gmail.com (Michael Thon)
Date: Fri, 14 Dec 2007 18:28:59 +0100
Subject: [EMBOSS] EMBOSS database queries
In-Reply-To: <47600A4F.8030107@ebi.ac.uk>
References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com>
	<475FC44F.7030700@ebi.ac.uk>
	<4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com>
	<47600A4F.8030107@ebi.ac.uk>
Message-ID: <2F72B57C-A8C4-4A8E-84B6-5793764DBDD4@gmail.com>


On Dec 12, 2007, at 5:20 PM, Peter Rice wrote:

> Michael Thon wrote:
>> Thanks Peter, I got it working.
>> While I'm at it, a couple more questions popped up:
>> 1) do you know if  these indexes compatible with the  
>> Bio::DB::Registry type databases?
>
> No ... well, we could add Bio::DB indices to the things EMBOSS can  
> retrieve, then they would be :-)
>
>> 2) Is there any way to index and search sequence features?
>
> Not at present - but:
>
> 2a. what would you like to search for ...
> 2b. what would you like as the result ...
>  2b.i. if you want features, what do we call them?
>
Actually, I haven't given it much thought.  But, for starters, one  
might want to retrieve proteins containing domain X, or that are  
annotated with interpro term Y.  Perhaps some of this functionality  
could be accomplished though clever use of the key or des fields i.e.  
by putting all the Interpro terms assigned to a protein in the keyword  
field prior to indexing.

One might also want to query a database of genomic DNA and fetch a  
translation of a gene or its spliced CDS.
best
Mike


From bernd.web at gmail.com  Mon Dec 17 20:01:32 2007
From: bernd.web at gmail.com (Bernd Web)
Date: Mon, 17 Dec 2007 21:01:32 +0100
Subject: [EMBOSS] iep/gifasta
Message-ID: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com>

Hi,

I'd like to run iep on a sequence and use either pir or osformat gifasta.
The following gives an error (using emboss 5.0.0 on Debian):

iep -filter -osformat gifasta -sequence seq.txt
This returns "Died: Unknown qualifier -osformat"

iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt
also give an error:
"Died: iep terminated: Bad value for '-sequence' with -auto defined"
(with or without the sequence flag)

However, iep -sformat fasta seq.txt works. What am I doing wrong?

I'd like output to contain the accession number. I thought -osformat
gifasta was for this purpose.
My FastA definition line is e.g.
>ENSG00000205090|1|protein_coding.
The IEP report would me more useful if it contains the ENSG number
instead of "protein coding or the entire definition line.

How to do this?


Kind regards,
Bernd


From pmr at ebi.ac.uk  Tue Dec 18 09:23:18 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 18 Dec 2007 09:23:18 +0000
Subject: [EMBOSS] iep/gifasta
In-Reply-To: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com>
References: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com>
Message-ID: <47679186.6020003@ebi.ac.uk>

Hi Bernd,

Bernd Web wrote:
> Hi,
> 
> I'd like to run iep on a sequence and use either pir or osformat gifasta.
> The following gives an error (using emboss 5.0.0 on Debian):
> 
> iep -filter -osformat gifasta -sequence seq.txt
> This returns "Died: Unknown qualifier -osformat"

-osformat is for sequence outputs (and iep has no sequence outputs)

iep writes a plain text file as output and no special options
but we will add more information (accession and description) for a 
future release ... and to other plain text output files too.

> iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt
> also give an error:
> "Died: iep terminated: Bad value for '-sequence' with -auto defined"
> (with or without the sequence flag)
> 
> However, iep -sformat fasta seq.txt works. What am I doing wrong?

It appears your sequence can be read in fasta format but not in pir 
format. PIR format has special characters after the first '>'

> My FastA definition line is e.g.
>> ENSG00000205090|1|protein_coding.
> The IEP report would me more useful if it contains the ENSG number
> instead of "protein coding or the entire definition line.

Not a nice format. NCBI made up a lot of FASTA file identifiers with '|' 
characters and we try to follow their rules. That causes us to ignore 
the first part (it should be a database name) and reas the ID from the end.

You could reformat the FASTA files (e.g. with a perl script) to remove 
the '|' characters and leave something useful as the plain ID (perhaps 
ENSG00000205090_1 in this case) and the rest as description.

Hope that helps,

Peter Rice


From peter.robinson at t-online.de  Thu Dec 20 15:08:59 2007
From: peter.robinson at t-online.de (Peter Robinson)
Date: Thu, 20 Dec 2007 16:08:59 +0100
Subject: [EMBOSS] Seqall Datatype
Message-ID: <476A858B.9080403@t-online.de>

Dear EMBOSSERs,

I am trying my hand at an EMBOSS program and would like to read in a 
list of sequences from a FASTA file and make pairwise comparisons 
between each sequence.  If I startwith a AjPSeqall object

AjPSeqall seqs=NULL;
seqs = ajAcdGetSeqall ("seqs");

I have seen

AjPSeq seq;
   
while(ajSeqallNext(seqs, &seq)) {

}

in the documentation, but I would like to do something like a double for 
loop to get all pairwise comparisons. What is the best way of doing 
this? I have been searching in the online docs but did not yet find 
anything.

By the way, in http://emboss.sourceforge.net/developers/program.html

*17.2 Getting information from a sequence*

*ajSeqGetName* get the name. This is a pointer to the internal AjPStr

*ajSeqName* get the name. This is a pointer to the internal char*

these datatypes are flagged as obsolete by the compiler, so the document 
may need revision here?

Thanks,
Peter Robinson


From pmr at ebi.ac.uk  Thu Dec 20 16:07:18 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 20 Dec 2007 16:07:18 +0000
Subject: [EMBOSS] Seqall Datatype
In-Reply-To: <476A858B.9080403@t-online.de>
References: <476A858B.9080403@t-online.de>
Message-ID: <476A9336.9090504@ebi.ac.uk>

Dear Peter,

> I am trying my hand at an EMBOSS program and would like to read in a 
> list of sequences from a FASTA file and make pairwise comparisons 
> between each sequence.  If I startwith a AjPSeqall object
> 
> AjPSeqall seqs=NULL;
> seqs = ajAcdGetSeqall ("seqs");

You want all the sequences in memory so you can work through the pairs - 
better to use AjPSeqset and ajAcdGetSeqset

> in the documentation, but I would like to do something like a double for 
> loop to get all pairwise comparisons. What is the best way of doing 
> this? I have been searching in the online docs but did not yet find 
> anything.

distmat has the kind of loop you are looking for (except it does self 
matches too)


> By the way, in http://emboss.sourceforge.net/developers/program.html
> 
> these datatypes are flagged as obsolete by the compiler, so the document 
> may need revision here?

Yes, all being revised for the books we are preparing ... we will take a 
look through program.html and make some basic updates to correct these 
things.

regards,

Peter


From peter.robinson at t-online.de  Thu Dec 20 17:00:38 2007
From: peter.robinson at t-online.de (Peter Robinson)
Date: Thu, 20 Dec 2007 18:00:38 +0100
Subject: [EMBOSS] Seqall Datatype
In-Reply-To: <476A9336.9090504@ebi.ac.uk>
References: <476A858B.9080403@t-online.de> <476A9336.9090504@ebi.ac.uk>
Message-ID: <476A9FB6.6080407@t-online.de>

Peter Rice wrote:
> Dear Peter,
>
>> I am trying my hand at an EMBOSS program and would like to read in a 
>> list of sequences from a FASTA file and make pairwise comparisons 
>> between each sequence.  If I startwith a AjPSeqall object
>>
>> AjPSeqall seqs=NULL;
>> seqs = ajAcdGetSeqall ("seqs");
>
> You want all the sequences in memory so you can work through the pairs 
> - better to use AjPSeqset and ajAcdGetSeqset
>
>> in the documentation, but I would like to do something like a double 
>> for loop to get all pairwise comparisons. What is the best way of 
>> doing this? I have been searching in the online docs but did not yet 
>> find anything.
>
> distmat has the kind of loop you are looking for (except it does self 
> matches too)
>
>
>> By the way, in http://emboss.sourceforge.net/developers/program.html
>>
>> these datatypes are flagged as obsolete by the compiler, so the 
>> document may need revision here?
>
> Yes, all being revised for the books we are preparing ... we will take 
> a look through program.html and make some basic updates to correct 
> these things.
>
> regards,
>
> Peter
>
Dear Peter,
thanks for the tip, that was just what I needed!
best wishes for the holidays!
Peter


From staffa at niehs.nih.gov  Thu Dec 20 21:44:02 2007
From: staffa at niehs.nih.gov (Staffa, Nick (NIH/NIEHS))
Date: Thu, 20 Dec 2007 16:44:02 -0500
Subject: [EMBOSS] newcpgreport
Message-ID: <C3904C52.77F8%staffa@niehs.nih.gov>

I have been using EMBOSS newcpgreport by
Rodrigo Lopez (rls ? ebi.ac.uk)
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK

http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html
says:
By default, this program defines a CpG island as a region where, over an
average of 10 windows, the calculated % composition is over 50% and the
calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum
of 200 bases. These conditions can be modified by setting the values of the
appropriate parameters.

I may be very dull and unimaginative, but I'd sure like a more detailed
explanation of what the program is doing to define a CpG island.
Does anyone know where this might be found?
Or even the code.

Can anyone help please.

Thanks
 
Nick Staffa 
Telephone: 919-316-4569  (NIEHS: 6-4569)
Scientific Computing Support Group
NIEHS Information Technology Support Services Contract
(Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov)
National Institute of Environmental Health Sciences
National Institutes of Health
Research Triangle Park, North Carolina


From pmr at ebi.ac.uk  Fri Dec 21 09:09:18 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 21 Dec 2007 09:09:18 +0000
Subject: [EMBOSS] newcpgreport
In-Reply-To: <C3904C52.77F8%staffa@niehs.nih.gov>
References: <C3904C52.77F8%staffa@niehs.nih.gov>
Message-ID: <476B82BE.7000602@ebi.ac.uk>

Staffa, Nick (NIH/NIEHS) wrote:
> I have been using EMBOSS newcpgreport by
> Rodrigo Lopez (rls ? ebi.ac.uk)
> 
> I may be very dull and unimaginative, but I'd sure like a more detailed
> explanation of what the program is doing to define a CpG island.
> Does anyone know where this might be found?
> Or even the code.

The code is included in EMBOSS (as emboss/newcpgreport.c)

The original reference for the CpG island criteria is in the paper 
listed in the "references" section of the newcpgreport documentation.

Larsen F., Gundersen, G., Lopez L., Prydz H. "CpG island as Gene Markers 
in the Human Genome" Genomics 13:1095-1107 (1992)

If memory serves, this refers to earlier work by Gardiner-Garden.

If you need more information I am just along the corridor from Rodrigo's 
office ... once we're both back after Xmas :-)

Hope that helps,

Peter


From rls at ebi.ac.uk  Fri Dec 21 09:11:05 2007
From: rls at ebi.ac.uk (Rodrigo Lopez)
Date: Fri, 21 Dec 2007 09:11:05 +0000
Subject: [EMBOSS] newcpgreport
In-Reply-To: <C3904C52.77F8%staffa@niehs.nih.gov>
References: <C3904C52.77F8%staffa@niehs.nih.gov>
Message-ID: <476B8329.9060007@ebi.ac.uk>

Hi,

The relevant papers describing the method in detail are:

PubMed:3656447
Gardiner-Garden M., Frommer M.
CpG islands in vertebrate genomes.
(20-Jul-1987) Journal of molecular biology, 196 (2) :261-82

PubMed:1505946
Larsen F., Gundersen G., Lopez R., Prydz H.
CpG islands as gene markers in the human genome.
(Aug-1992) Genomics, 13 (4) :1095-107

The source code - currently maintained by the EMBOSS team - is in the 
EMBOSS distribution.  See your <yourdir>/EMBOSS-5.0.0/emboss/newcpgreport.c

Hope this helps. Please do not hesitate to contact me if you have 
further queries.

R:)


Staffa, Nick (NIH/NIEHS) wrote:
> I have been using EMBOSS newcpgreport by
> Rodrigo Lopez (rls ? ebi.ac.uk)
> European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,
> Cambridge CB10 1SD, UK
> 
> http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html
> says:
> By default, this program defines a CpG island as a region where, over an
> average of 10 windows, the calculated % composition is over 50% and the
> calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum
> of 200 bases. These conditions can be modified by setting the values of the
> appropriate parameters.
> 
> I may be very dull and unimaginative, but I'd sure like a more detailed
> explanation of what the program is doing to define a CpG island.
> Does anyone know where this might be found?
> Or even the code.
> 
> Can anyone help please.
> 
> Thanks
>  
> Nick Staffa 
> Telephone: 919-316-4569  (NIEHS: 6-4569)
> Scientific Computing Support Group
> NIEHS Information Technology Support Services Contract
> (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov)
> National Institute of Environmental Health Sciences
> National Institutes of Health
> Research Triangle Park, North Carolina
> 
> 


From gbottu at vub.ac.be  Wed Dec 26 14:46:40 2007
From: gbottu at vub.ac.be (Guy Bottu)
Date: Wed, 26 Dec 2007 15:46:40 +0100
Subject: [EMBOSS] extractalign
Message-ID: <47726950.6070007@vub.ac.be>

	Dear all,

I just noticed that EMBOSS version 5 contains a program extractalign, which
extracts ranges from a multiple sequence alignment. This is certainly an
interesting tool. The program is however not accompanied by an on-line 
manual
and it is not mentioned in the Changelog. Any comment fom the developers ?

	Happy Christmas to you all,
	Guy Bottu,
	BEN


From david at compbio.dundee.ac.uk  Thu Dec 27 11:31:41 2007
From: david at compbio.dundee.ac.uk (David Martin)
Date: Thu, 27 Dec 2007 11:31:41 +0000
Subject: [EMBOSS] Identifying sequence formats.
Message-ID: <C3993D9D.2D2E0%david@compbio.dundee.ac.uk>

Is there an easy way of identifying the format of a sequence using EMBOSS?
It does wonderful autodetect but I'd like to be able to find out what it
thinks the sequence format is for an arbitrary sequence.

regards

..d


-- 
David Martin PhD
Post-Genomics and Molecular Interactions Centre
University of Dundee
http://www.compbio.dundee.ac.uk/
 

From pmr at ebi.ac.uk  Fri Dec 28 10:20:27 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 28 Dec 2007 10:20:27 +0000
Subject: [EMBOSS] Identifying sequence formats.
In-Reply-To: <C3993D9D.2D2E0%david@compbio.dundee.ac.uk>
References: <C3993D9D.2D2E0%david@compbio.dundee.ac.uk>
Message-ID: <4774CDEB.4040407@ebi.ac.uk>

David Martin wrote:
> Is there an easy way of identifying the format of a sequence using EMBOSS?
> It does wonderful autodetect but I'd like to be able to find out what it
> thinks the sequence format is for an arbitrary sequence.

The information is stored so you can craft a little application to print 
out the value of the FormatStr attribute.

There may be some oddities .... it automatically switches between 
EMBL/SwissProt and FASTA/NCBI formats depending on the first line. Let 
us know and we can look to apply corrections.

Season's greetings and all the best for the New Year

Peter


From pmr at ebi.ac.uk  Fri Dec 28 10:37:07 2007
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 28 Dec 2007 10:37:07 +0000
Subject: [EMBOSS] extractalign
In-Reply-To: <47726950.6070007@vub.ac.be>
References: <47726950.6070007@vub.ac.be>
Message-ID: <4774D1D3.4050503@ebi.ac.uk>

Guy Bottu wrote:
>     Dear all,
> 
> I just noticed that EMBOSS version 5 contains a program extractalign, which
> extracts ranges from a multiple sequence alignment. This is certainly an
> interesting tool. The program is however not accompanied by an on-line 
> manual
> and it is not mentioned in the Changelog. Any comment fom the developers ?

Well ... it is accompanied by an online manual .... just not included in 
the programs index.

edialign and wordfinder were also missing.

Now to update the ChangeLog (wordfinder is missing there too)...

Season's greetings and Happy New Year

Peter