From aengus.stewart at cancer.org.uk Wed Dec 5 08:19:05 2007 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Wed, 05 Dec 2007 13:19:05 +0000 Subject: [EMBOSS] restrict -limit Message-ID: <4756A549.1030303@cancer.org.uk> I seem to be having trouble with restrict not picking up -limit or am I not using it correctly? I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ??? ######################################## # Program: restrict # Rundate: Wed 5 Dec 2007 13:17:08 # Commandline: restrict # -sitelen 4 # -enzymes all # -limit # -blunt # -single # [-sequence] rs9584819.ff # -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict # Report_format: table # Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict ######################################## #======================================= # # Sequence: rs9584819 from: 1 to: 101 # HitCount: 24 # # Minimum cuts per enzyme: 1 # Maximum cuts per enzyme: 1 # Minimum length of recognition site: 4 # Blunt ends allowed # Sticky ends allowed # DNA is linear # Ambiguities allowed # #======================================= Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 13 17 BssKI CCNGG 12 17 . . 13 17 BseBI CCWGG 14 15 . . 13 17 ScrFI CCNGG 14 15 . . 13 17 EcoRII CCWGG 12 17 . . Regards Aengus -- ----------------------------------------------------------------------- Aengus Stewart Head of Bioinformatics and BioStatistics Bioinformatics and BioStatistics Tel: +44 (0)20 7269 3679 Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. From aengus.stewart at cancer.org.uk Wed Dec 5 10:00:50 2007 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Wed, 05 Dec 2007 15:00:50 +0000 Subject: [EMBOSS] restrict -limit In-Reply-To: <4756A549.1030303@cancer.org.uk> References: <4756A549.1030303@cancer.org.uk> Message-ID: <4756BD22.6070902@cancer.org.uk> Yeah I know, not one of my brightest days............... Helps to look at cut position as well as motif *sigh* Aengus Aengus Stewart wrote: > I seem to be having trouble with restrict not picking up -limit or am I not using it correctly? > > I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ??? > > ######################################## > # Program: restrict > # Rundate: Wed 5 Dec 2007 13:17:08 > # Commandline: restrict > # -sitelen 4 > # -enzymes all > # -limit > # -blunt > # -single > # [-sequence] rs9584819.ff > # -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > # Report_format: table > # Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > ######################################## > > #======================================= > # > # Sequence: rs9584819 from: 1 to: 101 > # HitCount: 24 > # > # Minimum cuts per enzyme: 1 > # Maximum cuts per enzyme: 1 > # Minimum length of recognition site: 4 > # Blunt ends allowed > # Sticky ends allowed > # DNA is linear > # Ambiguities allowed > # > #======================================= > > Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev > 13 17 BssKI CCNGG 12 17 . . > 13 17 BseBI CCWGG 14 15 . . > 13 17 ScrFI CCNGG 14 15 . . > 13 17 EcoRII CCWGG 12 17 . . > > > > > Regards > Aengus > > > -- ----------------------------------------------------------------------- Aengus Stewart Head of Bioinformatics and BioStatistics Bioinformatics and BioStatistics Tel: +44 (0)20 7269 3679 Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. From ajb at ebi.ac.uk Wed Dec 5 10:08:26 2007 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Wed, 5 Dec 2007 15:08:26 -0000 (GMT) Subject: [EMBOSS] restrict -limit In-Reply-To: <4756A549.1030303@cancer.org.uk> References: <4756A549.1030303@cancer.org.uk> Message-ID: <56936.81.98.241.17.1196867306.squirrel@webmail.ebi.ac.uk> Hello Aengus, Restrict will report enzymes with the same recognition site if the source REBASE database lists them as having different cut sites. That appears to be the case with your reported output. So, you do seem to be using it correctly and the results also seem to be correct. Alan > > I seem to be having trouble with restrict not picking up -limit or am I > not using it correctly? > > I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI > and EcoRII ??? > > ######################################## > # Program: restrict > # Rundate: Wed 5 Dec 2007 13:17:08 > # Commandline: restrict > # -sitelen 4 > # -enzymes all > # -limit > # -blunt > # -single > # [-sequence] rs9584819.ff > # -outfile > /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > # Report_format: table > # Report_file: > /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > ######################################## > > #======================================= > # > # Sequence: rs9584819 from: 1 to: 101 > # HitCount: 24 > # > # Minimum cuts per enzyme: 1 > # Maximum cuts per enzyme: 1 > # Minimum length of recognition site: 4 > # Blunt ends allowed > # Sticky ends allowed > # DNA is linear > # Ambiguities allowed > # > #======================================= > > Start End Enzyme_name Restriction_site 5prime 3prime 5primerev > 3primerev > 13 17 BssKI CCNGG 12 17 . > . > 13 17 BseBI CCWGG 14 15 . > . > 13 17 ScrFI CCNGG 14 15 . > . > 13 17 EcoRII CCWGG 12 17 . > . > > > > > Regards > Aengus > > > > -- > ----------------------------------------------------------------------- > Aengus Stewart > Head of Bioinformatics and BioStatistics > Bioinformatics and BioStatistics Tel: +44 (0)20 7269 3679 > Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK > ----------------------------------------------------------------------- > > This electronic message contains information which may be privileged and > confidential. The information is intended to be for the use of the > individual(s) or entity named above. Be aware that any third party > disclosure, distribution, copying or use of this communication, without > prior permission, is strictly prohibited. > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From gbottu at vub.ac.be Wed Dec 5 10:30:25 2007 From: gbottu at vub.ac.be (Guy Bottu) Date: Wed, 05 Dec 2007 16:30:25 +0100 Subject: [EMBOSS] restrict -limit In-Reply-To: <4756A549.1030303@cancer.org.uk> References: <4756A549.1030303@cancer.org.uk> Message-ID: <4756C411.8090101@vub.ac.be> Aengus Stewart wrote: > I seem to be having trouble with restrict not picking up -limit or am I not using it correctly? restrict by default searches only for prototype enzymes ; if you want to see all enzymes you must explicitly set -nolimit. I however notice that also at our site the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI and BseBI, while it should. Maybe there is a bug in the program rebaseextract or some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ? Guy Bottu, Belgian EMBnet Node From pmr at ebi.ac.uk Wed Dec 5 10:57:59 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 05 Dec 2007 15:57:59 +0000 Subject: [EMBOSS] restrict -limit In-Reply-To: <4756C411.8090101@vub.ac.be> References: <4756A549.1030303@cancer.org.uk> <4756C411.8090101@vub.ac.be> Message-ID: <4756CA87.2080206@ebi.ac.uk> Guy Bottu wrote: > I however notice that also at our site > the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI > and BseBI, while it should. Maybe there is a bug in the program rebaseextract or > some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ? Which version of REBASE did you use for rebaseextract? Peter From sum732 at mail.usask.ca Fri Dec 7 18:01:43 2007 From: sum732 at mail.usask.ca (Sudeep Mehrotra) Date: Fri, 07 Dec 2007 17:01:43 -0600 Subject: [EMBOSS] Emboss-Digest Message-ID: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca> Hello, I used "digest" from EMBOSS to digest protein database obtained from NCBI REFSEQ. Here is how I executed digest: digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name" From the list I selected trypsin For some reason, digest skipped (no fragments were generated) for this particular protein >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis stolonifera] MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR any ideas? I should get two fragments. I don't want to see the partial digests so that is why I never selected the option. Thanks Sudeep From ajb at ebi.ac.uk Fri Dec 7 20:13:31 2007 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Sat, 8 Dec 2007 01:13:31 -0000 (GMT) Subject: [EMBOSS] Emboss-Digest In-Reply-To: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca> References: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca> Message-ID: <34776.81.98.241.17.1197076411.squirrel@webmail.ebi.ac.uk> Dear Sudeep, Trypsin doesn't cut as well if (e.g.) the K is followed by any of "KRIFLP" (Prof. D. Pappin, personal comm). Your sequence contains "...KL..." so there is no cut. If you want unfavoured cuts to be shown (e.g. a cut after every K for trypsin) then add the flag "-unfavoured" to the command line. HTH Alan > Hello, > I used "digest" from EMBOSS to digest protein database obtained from > NCBI REFSEQ. > Here is how I executed digest: > digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name" > From the list I selected trypsin > For some reason, digest skipped (no fragments were generated) for this > particular protein > >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis > stolonifera] > MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR > > any ideas? > > I should get two fragments. I don't want to see the partial digests so > that is why I never selected the option. > > Thanks > Sudeep > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From mike.thon at gmail.com Wed Dec 12 05:24:34 2007 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 12 Dec 2007 11:24:34 +0100 Subject: [EMBOSS] EMBOSS database queries Message-ID: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> I am setting up a database from Genbank formatted files. I understand how to index the db and configure the emboss.default file but I don't know how to construct the queries. queries for sequence IDs are pretty simple, i.e. with a USA of the format "dbname:id". But, how to I create a query for the other fields, such as org and key? Also, do these fields support wildcards or substring matches or other fancy stuff? cheers Mike From pmr at ebi.ac.uk Wed Dec 12 06:21:51 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 12 Dec 2007 11:21:51 +0000 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> Message-ID: <475FC44F.7030700@ebi.ac.uk> Michael Thon wrote: > I am setting up a database from Genbank formatted files. I understand > how to index the db and configure the emboss.default file but I don't > know how to construct the queries. queries for sequence IDs are pretty > simple, i.e. with a USA of the format "dbname:id". But, how to I create > a query for the other fields, such as org and key? Also, do these > fields support wildcards or substring matches or other fancy stuff? Assuming you indexed all the fields (by default ID and ACC are indexed) you use the same syntax as in srs (we saw no need to invent a new syntax, so we used the same field name abbreviations but we did drop the '[]' around the query :-) dbname-acc:x13776 dbname-org:pseudomonas* dbname-des:amidase dbname-key: dbname-sv: dbname-gi: and, to complete the set, dbname-id:x13776 As you see, wildcards are allowed with '*' at the end. We can make this much more sophisticated, allowing more wildcard options and combining queries. So far EMBOSS users have been content to use SRS or alternatives (MRS for example). If there is interest, we can extend the USA to include wildcards, AND/OR/NOT, search multiple fields, combine databases, and if we get really ambitious we could include links between databases. We will have to be careful to restrict some of these extensions to database access methods that support them. Hope this helps, Peter From mike.thon at gmail.com Wed Dec 12 11:12:05 2007 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 12 Dec 2007 17:12:05 +0100 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <475FC44F.7030700@ebi.ac.uk> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> <475FC44F.7030700@ebi.ac.uk> Message-ID: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> Thanks Peter, I got it working. While I'm at it, a couple more questions popped up: 1) do you know if these indexes compatible with the Bio::DB::Registry type databases? 2) Is there any way to index and search sequence features? Best Mike On Dec 12, 2007, at 12:21 PM, Peter Rice wrote: > Michael Thon wrote: >> I am setting up a database from Genbank formatted files. I >> understand how to index the db and configure the emboss.default >> file but I don't know how to construct the queries. queries for >> sequence IDs are pretty simple, i.e. with a USA of the format >> "dbname:id". But, how to I create a query for the other fields, >> such as org and key? Also, do these fields support wildcards or >> substring matches or other fancy stuff? > > Assuming you indexed all the fields (by default ID and ACC are > indexed) > you use the same syntax as in srs (we saw no need to invent a new > syntax, so we used the same field name abbreviations but we did drop > the > '[]' around the query :-) > > dbname-acc:x13776 > dbname-org:pseudomonas* > dbname-des:amidase > dbname-key: > dbname-sv: > dbname-gi: > > and, to complete the set, dbname-id:x13776 > > As you see, wildcards are allowed with '*' at the end. > > We can make this much more sophisticated, allowing more wildcard > options > and combining queries. So far EMBOSS users have been content to use > SRS > or alternatives (MRS for example). > > If there is interest, we can extend the USA to include wildcards, > AND/OR/NOT, search multiple fields, combine databases, and if we get > really ambitious we could include links between databases. > > We will have to be careful to restrict some of these extensions to > database access methods that support them. > > Hope this helps, > > Peter From pmr at ebi.ac.uk Wed Dec 12 11:20:31 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 12 Dec 2007 16:20:31 +0000 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> <475FC44F.7030700@ebi.ac.uk> <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> Message-ID: <47600A4F.8030107@ebi.ac.uk> Michael Thon wrote: > Thanks Peter, I got it working. > While I'm at it, a couple more questions popped up: > 1) do you know if these indexes compatible with the Bio::DB::Registry > type databases? No ... well, we could add Bio::DB indices to the things EMBOSS can retrieve, then they would be :-) > 2) Is there any way to index and search sequence features? Not at present - but: 2a. what would you like to search for ... 2b. what would you like as the result ... 2b.i. if you want features, what do we call them? regards, Peter From mike.thon at gmail.com Fri Dec 14 12:28:59 2007 From: mike.thon at gmail.com (Michael Thon) Date: Fri, 14 Dec 2007 18:28:59 +0100 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <47600A4F.8030107@ebi.ac.uk> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> <475FC44F.7030700@ebi.ac.uk> <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> <47600A4F.8030107@ebi.ac.uk> Message-ID: <2F72B57C-A8C4-4A8E-84B6-5793764DBDD4@gmail.com> On Dec 12, 2007, at 5:20 PM, Peter Rice wrote: > Michael Thon wrote: >> Thanks Peter, I got it working. >> While I'm at it, a couple more questions popped up: >> 1) do you know if these indexes compatible with the >> Bio::DB::Registry type databases? > > No ... well, we could add Bio::DB indices to the things EMBOSS can > retrieve, then they would be :-) > >> 2) Is there any way to index and search sequence features? > > Not at present - but: > > 2a. what would you like to search for ... > 2b. what would you like as the result ... > 2b.i. if you want features, what do we call them? > Actually, I haven't given it much thought. But, for starters, one might want to retrieve proteins containing domain X, or that are annotated with interpro term Y. Perhaps some of this functionality could be accomplished though clever use of the key or des fields i.e. by putting all the Interpro terms assigned to a protein in the keyword field prior to indexing. One might also want to query a database of genomic DNA and fetch a translation of a gene or its spliced CDS. best Mike From bernd.web at gmail.com Mon Dec 17 15:01:32 2007 From: bernd.web at gmail.com (Bernd Web) Date: Mon, 17 Dec 2007 21:01:32 +0100 Subject: [EMBOSS] iep/gifasta Message-ID: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com> Hi, I'd like to run iep on a sequence and use either pir or osformat gifasta. The following gives an error (using emboss 5.0.0 on Debian): iep -filter -osformat gifasta -sequence seq.txt This returns "Died: Unknown qualifier -osformat" iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt also give an error: "Died: iep terminated: Bad value for '-sequence' with -auto defined" (with or without the sequence flag) However, iep -sformat fasta seq.txt works. What am I doing wrong? I'd like output to contain the accession number. I thought -osformat gifasta was for this purpose. My FastA definition line is e.g. >ENSG00000205090|1|protein_coding. The IEP report would me more useful if it contains the ENSG number instead of "protein coding or the entire definition line. How to do this? Kind regards, Bernd From pmr at ebi.ac.uk Tue Dec 18 04:23:18 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 18 Dec 2007 09:23:18 +0000 Subject: [EMBOSS] iep/gifasta In-Reply-To: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com> References: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com> Message-ID: <47679186.6020003@ebi.ac.uk> Hi Bernd, Bernd Web wrote: > Hi, > > I'd like to run iep on a sequence and use either pir or osformat gifasta. > The following gives an error (using emboss 5.0.0 on Debian): > > iep -filter -osformat gifasta -sequence seq.txt > This returns "Died: Unknown qualifier -osformat" -osformat is for sequence outputs (and iep has no sequence outputs) iep writes a plain text file as output and no special options but we will add more information (accession and description) for a future release ... and to other plain text output files too. > iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt > also give an error: > "Died: iep terminated: Bad value for '-sequence' with -auto defined" > (with or without the sequence flag) > > However, iep -sformat fasta seq.txt works. What am I doing wrong? It appears your sequence can be read in fasta format but not in pir format. PIR format has special characters after the first '>' > My FastA definition line is e.g. >> ENSG00000205090|1|protein_coding. > The IEP report would me more useful if it contains the ENSG number > instead of "protein coding or the entire definition line. Not a nice format. NCBI made up a lot of FASTA file identifiers with '|' characters and we try to follow their rules. That causes us to ignore the first part (it should be a database name) and reas the ID from the end. You could reformat the FASTA files (e.g. with a perl script) to remove the '|' characters and leave something useful as the plain ID (perhaps ENSG00000205090_1 in this case) and the rest as description. Hope that helps, Peter Rice From peter.robinson at t-online.de Thu Dec 20 10:08:59 2007 From: peter.robinson at t-online.de (Peter Robinson) Date: Thu, 20 Dec 2007 16:08:59 +0100 Subject: [EMBOSS] Seqall Datatype Message-ID: <476A858B.9080403@t-online.de> Dear EMBOSSERs, I am trying my hand at an EMBOSS program and would like to read in a list of sequences from a FASTA file and make pairwise comparisons between each sequence. If I startwith a AjPSeqall object AjPSeqall seqs=NULL; seqs = ajAcdGetSeqall ("seqs"); I have seen AjPSeq seq; while(ajSeqallNext(seqs, &seq)) { } in the documentation, but I would like to do something like a double for loop to get all pairwise comparisons. What is the best way of doing this? I have been searching in the online docs but did not yet find anything. By the way, in http://emboss.sourceforge.net/developers/program.html *17.2 Getting information from a sequence* *ajSeqGetName* get the name. This is a pointer to the internal AjPStr *ajSeqName* get the name. This is a pointer to the internal char* these datatypes are flagged as obsolete by the compiler, so the document may need revision here? Thanks, Peter Robinson From pmr at ebi.ac.uk Thu Dec 20 11:07:18 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 20 Dec 2007 16:07:18 +0000 Subject: [EMBOSS] Seqall Datatype In-Reply-To: <476A858B.9080403@t-online.de> References: <476A858B.9080403@t-online.de> Message-ID: <476A9336.9090504@ebi.ac.uk> Dear Peter, > I am trying my hand at an EMBOSS program and would like to read in a > list of sequences from a FASTA file and make pairwise comparisons > between each sequence. If I startwith a AjPSeqall object > > AjPSeqall seqs=NULL; > seqs = ajAcdGetSeqall ("seqs"); You want all the sequences in memory so you can work through the pairs - better to use AjPSeqset and ajAcdGetSeqset > in the documentation, but I would like to do something like a double for > loop to get all pairwise comparisons. What is the best way of doing > this? I have been searching in the online docs but did not yet find > anything. distmat has the kind of loop you are looking for (except it does self matches too) > By the way, in http://emboss.sourceforge.net/developers/program.html > > these datatypes are flagged as obsolete by the compiler, so the document > may need revision here? Yes, all being revised for the books we are preparing ... we will take a look through program.html and make some basic updates to correct these things. regards, Peter From peter.robinson at t-online.de Thu Dec 20 12:00:38 2007 From: peter.robinson at t-online.de (Peter Robinson) Date: Thu, 20 Dec 2007 18:00:38 +0100 Subject: [EMBOSS] Seqall Datatype In-Reply-To: <476A9336.9090504@ebi.ac.uk> References: <476A858B.9080403@t-online.de> <476A9336.9090504@ebi.ac.uk> Message-ID: <476A9FB6.6080407@t-online.de> Peter Rice wrote: > Dear Peter, > >> I am trying my hand at an EMBOSS program and would like to read in a >> list of sequences from a FASTA file and make pairwise comparisons >> between each sequence. If I startwith a AjPSeqall object >> >> AjPSeqall seqs=NULL; >> seqs = ajAcdGetSeqall ("seqs"); > > You want all the sequences in memory so you can work through the pairs > - better to use AjPSeqset and ajAcdGetSeqset > >> in the documentation, but I would like to do something like a double >> for loop to get all pairwise comparisons. What is the best way of >> doing this? I have been searching in the online docs but did not yet >> find anything. > > distmat has the kind of loop you are looking for (except it does self > matches too) > > >> By the way, in http://emboss.sourceforge.net/developers/program.html >> >> these datatypes are flagged as obsolete by the compiler, so the >> document may need revision here? > > Yes, all being revised for the books we are preparing ... we will take > a look through program.html and make some basic updates to correct > these things. > > regards, > > Peter > Dear Peter, thanks for the tip, that was just what I needed! best wishes for the holidays! Peter From staffa at niehs.nih.gov Thu Dec 20 16:44:02 2007 From: staffa at niehs.nih.gov (Staffa, Nick (NIH/NIEHS)) Date: Thu, 20 Dec 2007 16:44:02 -0500 Subject: [EMBOSS] newcpgreport Message-ID: I have been using EMBOSS newcpgreport by Rodrigo Lopez (rls ? ebi.ac.uk) European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html says: By default, this program defines a CpG island as a region where, over an average of 10 windows, the calculated % composition is over 50% and the calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum of 200 bases. These conditions can be modified by setting the values of the appropriate parameters. I may be very dull and unimaginative, but I'd sure like a more detailed explanation of what the program is doing to define a CpG island. Does anyone know where this might be found? Or even the code. Can anyone help please. Thanks Nick Staffa Telephone: 919-316-4569 (NIEHS: 6-4569) Scientific Computing Support Group NIEHS Information Technology Support Services Contract (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov) National Institute of Environmental Health Sciences National Institutes of Health Research Triangle Park, North Carolina From pmr at ebi.ac.uk Fri Dec 21 04:09:18 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 21 Dec 2007 09:09:18 +0000 Subject: [EMBOSS] newcpgreport In-Reply-To: References: Message-ID: <476B82BE.7000602@ebi.ac.uk> Staffa, Nick (NIH/NIEHS) wrote: > I have been using EMBOSS newcpgreport by > Rodrigo Lopez (rls ? ebi.ac.uk) > > I may be very dull and unimaginative, but I'd sure like a more detailed > explanation of what the program is doing to define a CpG island. > Does anyone know where this might be found? > Or even the code. The code is included in EMBOSS (as emboss/newcpgreport.c) The original reference for the CpG island criteria is in the paper listed in the "references" section of the newcpgreport documentation. Larsen F., Gundersen, G., Lopez L., Prydz H. "CpG island as Gene Markers in the Human Genome" Genomics 13:1095-1107 (1992) If memory serves, this refers to earlier work by Gardiner-Garden. If you need more information I am just along the corridor from Rodrigo's office ... once we're both back after Xmas :-) Hope that helps, Peter From rls at ebi.ac.uk Fri Dec 21 04:11:05 2007 From: rls at ebi.ac.uk (Rodrigo Lopez) Date: Fri, 21 Dec 2007 09:11:05 +0000 Subject: [EMBOSS] newcpgreport In-Reply-To: References: Message-ID: <476B8329.9060007@ebi.ac.uk> Hi, The relevant papers describing the method in detail are: PubMed:3656447 Gardiner-Garden M., Frommer M. CpG islands in vertebrate genomes. (20-Jul-1987) Journal of molecular biology, 196 (2) :261-82 PubMed:1505946 Larsen F., Gundersen G., Lopez R., Prydz H. CpG islands as gene markers in the human genome. (Aug-1992) Genomics, 13 (4) :1095-107 The source code - currently maintained by the EMBOSS team - is in the EMBOSS distribution. See your /EMBOSS-5.0.0/emboss/newcpgreport.c Hope this helps. Please do not hesitate to contact me if you have further queries. R:) Staffa, Nick (NIH/NIEHS) wrote: > I have been using EMBOSS newcpgreport by > Rodrigo Lopez (rls ? ebi.ac.uk) > European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, > Cambridge CB10 1SD, UK > > http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html > says: > By default, this program defines a CpG island as a region where, over an > average of 10 windows, the calculated % composition is over 50% and the > calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum > of 200 bases. These conditions can be modified by setting the values of the > appropriate parameters. > > I may be very dull and unimaginative, but I'd sure like a more detailed > explanation of what the program is doing to define a CpG island. > Does anyone know where this might be found? > Or even the code. > > Can anyone help please. > > Thanks > > Nick Staffa > Telephone: 919-316-4569 (NIEHS: 6-4569) > Scientific Computing Support Group > NIEHS Information Technology Support Services Contract > (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov) > National Institute of Environmental Health Sciences > National Institutes of Health > Research Triangle Park, North Carolina > > From gbottu at vub.ac.be Wed Dec 26 09:46:40 2007 From: gbottu at vub.ac.be (Guy Bottu) Date: Wed, 26 Dec 2007 15:46:40 +0100 Subject: [EMBOSS] extractalign Message-ID: <47726950.6070007@vub.ac.be> Dear all, I just noticed that EMBOSS version 5 contains a program extractalign, which extracts ranges from a multiple sequence alignment. This is certainly an interesting tool. The program is however not accompanied by an on-line manual and it is not mentioned in the Changelog. Any comment fom the developers ? Happy Christmas to you all, Guy Bottu, BEN From david at compbio.dundee.ac.uk Thu Dec 27 06:31:41 2007 From: david at compbio.dundee.ac.uk (David Martin) Date: Thu, 27 Dec 2007 11:31:41 +0000 Subject: [EMBOSS] Identifying sequence formats. Message-ID: Is there an easy way of identifying the format of a sequence using EMBOSS? It does wonderful autodetect but I'd like to be able to find out what it thinks the sequence format is for an arbitrary sequence. regards ..d -- David Martin PhD Post-Genomics and Molecular Interactions Centre University of Dundee http://www.compbio.dundee.ac.uk/ From pmr at ebi.ac.uk Fri Dec 28 05:20:27 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 28 Dec 2007 10:20:27 +0000 Subject: [EMBOSS] Identifying sequence formats. In-Reply-To: References: Message-ID: <4774CDEB.4040407@ebi.ac.uk> David Martin wrote: > Is there an easy way of identifying the format of a sequence using EMBOSS? > It does wonderful autodetect but I'd like to be able to find out what it > thinks the sequence format is for an arbitrary sequence. The information is stored so you can craft a little application to print out the value of the FormatStr attribute. There may be some oddities .... it automatically switches between EMBL/SwissProt and FASTA/NCBI formats depending on the first line. Let us know and we can look to apply corrections. Season's greetings and all the best for the New Year Peter From pmr at ebi.ac.uk Fri Dec 28 05:37:07 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 28 Dec 2007 10:37:07 +0000 Subject: [EMBOSS] extractalign In-Reply-To: <47726950.6070007@vub.ac.be> References: <47726950.6070007@vub.ac.be> Message-ID: <4774D1D3.4050503@ebi.ac.uk> Guy Bottu wrote: > Dear all, > > I just noticed that EMBOSS version 5 contains a program extractalign, which > extracts ranges from a multiple sequence alignment. This is certainly an > interesting tool. The program is however not accompanied by an on-line > manual > and it is not mentioned in the Changelog. Any comment fom the developers ? Well ... it is accompanied by an online manual .... just not included in the programs index. edialign and wordfinder were also missing. Now to update the ChangeLog (wordfinder is missing there too)... Season's greetings and Happy New Year Peter From aengus.stewart at cancer.org.uk Wed Dec 5 13:19:05 2007 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Wed, 05 Dec 2007 13:19:05 +0000 Subject: [EMBOSS] restrict -limit Message-ID: <4756A549.1030303@cancer.org.uk> I seem to be having trouble with restrict not picking up -limit or am I not using it correctly? I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ??? ######################################## # Program: restrict # Rundate: Wed 5 Dec 2007 13:17:08 # Commandline: restrict # -sitelen 4 # -enzymes all # -limit # -blunt # -single # [-sequence] rs9584819.ff # -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict # Report_format: table # Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict ######################################## #======================================= # # Sequence: rs9584819 from: 1 to: 101 # HitCount: 24 # # Minimum cuts per enzyme: 1 # Maximum cuts per enzyme: 1 # Minimum length of recognition site: 4 # Blunt ends allowed # Sticky ends allowed # DNA is linear # Ambiguities allowed # #======================================= Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 13 17 BssKI CCNGG 12 17 . . 13 17 BseBI CCWGG 14 15 . . 13 17 ScrFI CCNGG 14 15 . . 13 17 EcoRII CCWGG 12 17 . . Regards Aengus -- ----------------------------------------------------------------------- Aengus Stewart Head of Bioinformatics and BioStatistics Bioinformatics and BioStatistics Tel: +44 (0)20 7269 3679 Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. From aengus.stewart at cancer.org.uk Wed Dec 5 15:00:50 2007 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Wed, 05 Dec 2007 15:00:50 +0000 Subject: [EMBOSS] restrict -limit In-Reply-To: <4756A549.1030303@cancer.org.uk> References: <4756A549.1030303@cancer.org.uk> Message-ID: <4756BD22.6070902@cancer.org.uk> Yeah I know, not one of my brightest days............... Helps to look at cut position as well as motif *sigh* Aengus Aengus Stewart wrote: > I seem to be having trouble with restrict not picking up -limit or am I not using it correctly? > > I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI and EcoRII ??? > > ######################################## > # Program: restrict > # Rundate: Wed 5 Dec 2007 13:17:08 > # Commandline: restrict > # -sitelen 4 > # -enzymes all > # -limit > # -blunt > # -single > # [-sequence] rs9584819.ff > # -outfile /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > # Report_format: table > # Report_file: /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > ######################################## > > #======================================= > # > # Sequence: rs9584819 from: 1 to: 101 > # HitCount: 24 > # > # Minimum cuts per enzyme: 1 > # Maximum cuts per enzyme: 1 > # Minimum length of recognition site: 4 > # Blunt ends allowed > # Sticky ends allowed > # DNA is linear > # Ambiguities allowed > # > #======================================= > > Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev > 13 17 BssKI CCNGG 12 17 . . > 13 17 BseBI CCWGG 14 15 . . > 13 17 ScrFI CCNGG 14 15 . . > 13 17 EcoRII CCWGG 12 17 . . > > > > > Regards > Aengus > > > -- ----------------------------------------------------------------------- Aengus Stewart Head of Bioinformatics and BioStatistics Bioinformatics and BioStatistics Tel: +44 (0)20 7269 3679 Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. From ajb at ebi.ac.uk Wed Dec 5 15:08:26 2007 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Wed, 5 Dec 2007 15:08:26 -0000 (GMT) Subject: [EMBOSS] restrict -limit In-Reply-To: <4756A549.1030303@cancer.org.uk> References: <4756A549.1030303@cancer.org.uk> Message-ID: <56936.81.98.241.17.1196867306.squirrel@webmail.ebi.ac.uk> Hello Aengus, Restrict will report enzymes with the same recognition site if the source REBASE database lists them as having different cut sites. That appears to be the case with your reported output. So, you do seem to be using it correctly and the results also seem to be correct. Alan > > I seem to be having trouble with restrict not picking up -limit or am I > not using it correctly? > > I shouldnt be getting both BssKI and ScrFI should I or indeed both BseBI > and EcoRII ??? > > ######################################## > # Program: restrict > # Rundate: Wed 5 Dec 2007 13:17:08 > # Commandline: restrict > # -sitelen 4 > # -enzymes all > # -limit > # -blunt > # -single > # [-sequence] rs9584819.ff > # -outfile > /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > # Report_format: table > # Report_file: > /home2/bioadmin/projects/MANOJ/RESTRICT_RESULTS/rs9584819.ff.restrict > ######################################## > > #======================================= > # > # Sequence: rs9584819 from: 1 to: 101 > # HitCount: 24 > # > # Minimum cuts per enzyme: 1 > # Maximum cuts per enzyme: 1 > # Minimum length of recognition site: 4 > # Blunt ends allowed > # Sticky ends allowed > # DNA is linear > # Ambiguities allowed > # > #======================================= > > Start End Enzyme_name Restriction_site 5prime 3prime 5primerev > 3primerev > 13 17 BssKI CCNGG 12 17 . > . > 13 17 BseBI CCWGG 14 15 . > . > 13 17 ScrFI CCNGG 14 15 . > . > 13 17 EcoRII CCWGG 12 17 . > . > > > > > Regards > Aengus > > > > -- > ----------------------------------------------------------------------- > Aengus Stewart > Head of Bioinformatics and BioStatistics > Bioinformatics and BioStatistics Tel: +44 (0)20 7269 3679 > Cancer Research UK, Lincoln's Inn Fields, Holborn, London, WC2A 3PX, UK > ----------------------------------------------------------------------- > > This electronic message contains information which may be privileged and > confidential. The information is intended to be for the use of the > individual(s) or entity named above. Be aware that any third party > disclosure, distribution, copying or use of this communication, without > prior permission, is strictly prohibited. > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From gbottu at vub.ac.be Wed Dec 5 15:30:25 2007 From: gbottu at vub.ac.be (Guy Bottu) Date: Wed, 05 Dec 2007 16:30:25 +0100 Subject: [EMBOSS] restrict -limit In-Reply-To: <4756A549.1030303@cancer.org.uk> References: <4756A549.1030303@cancer.org.uk> Message-ID: <4756C411.8090101@vub.ac.be> Aengus Stewart wrote: > I seem to be having trouble with restrict not picking up -limit or am I not using it correctly? restrict by default searches only for prototype enzymes ; if you want to see all enzymes you must explicitly set -nolimit. I however notice that also at our site the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI and BseBI, while it should. Maybe there is a bug in the program rebaseextract or some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ? Guy Bottu, Belgian EMBnet Node From pmr at ebi.ac.uk Wed Dec 5 15:57:59 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 05 Dec 2007 15:57:59 +0000 Subject: [EMBOSS] restrict -limit In-Reply-To: <4756C411.8090101@vub.ac.be> References: <4756A549.1030303@cancer.org.uk> <4756C411.8090101@vub.ac.be> Message-ID: <4756CA87.2080206@ebi.ac.uk> Guy Bottu wrote: > I however notice that also at our site > the file .../share/EMBOSS/data/embossre.equ does not contain entries for BssKI > and BseBI, while it should. Maybe there is a bug in the program rebaseextract or > some subtle typo in the files from the Rebase. Could the EMBOSS team figure it out ? Which version of REBASE did you use for rebaseextract? Peter From sum732 at mail.usask.ca Fri Dec 7 23:01:43 2007 From: sum732 at mail.usask.ca (Sudeep Mehrotra) Date: Fri, 07 Dec 2007 17:01:43 -0600 Subject: [EMBOSS] Emboss-Digest Message-ID: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca> Hello, I used "digest" from EMBOSS to digest protein database obtained from NCBI REFSEQ. Here is how I executed digest: digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name" From the list I selected trypsin For some reason, digest skipped (no fragments were generated) for this particular protein >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis stolonifera] MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR any ideas? I should get two fragments. I don't want to see the partial digests so that is why I never selected the option. Thanks Sudeep From ajb at ebi.ac.uk Sat Dec 8 01:13:31 2007 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Sat, 8 Dec 2007 01:13:31 -0000 (GMT) Subject: [EMBOSS] Emboss-Digest In-Reply-To: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca> References: <7F75181D-3B12-4A0D-99DC-590DD253502F@mail.usask.ca> Message-ID: <34776.81.98.241.17.1197076411.squirrel@webmail.ebi.ac.uk> Dear Sudeep, Trypsin doesn't cut as well if (e.g.) the K is followed by any of "KRIFLP" (Prof. D. Pappin, personal comm). Your sequence contains "...KL..." so there is no cut. If you want unfavoured cuts to be shown (e.g. a cut after every K for trypsin) then add the flag "-unfavoured" to the command line. HTH Alan > Hello, > I used "digest" from EMBOSS to digest protein database obtained from > NCBI REFSEQ. > Here is how I executed digest: > digest -seqall "DB_NAME" -aadata "File_name"- outfile "File_Name" > From the list I selected trypsin > For some reason, digest skipped (no fragments were generated) for this > particular protein > >gi|118430285|ref|YP_874719.1| photosystem II protein K [Agrostis > stolonifera] > MPNILSLTCICFNSVLYPTTSFFFAKLPEAYAIFNPIVDVMPVIPLFFFLLAFVWQAAVSFR > > any ideas? > > I should get two fragments. I don't want to see the partial digests so > that is why I never selected the option. > > Thanks > Sudeep > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From mike.thon at gmail.com Wed Dec 12 10:24:34 2007 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 12 Dec 2007 11:24:34 +0100 Subject: [EMBOSS] EMBOSS database queries Message-ID: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> I am setting up a database from Genbank formatted files. I understand how to index the db and configure the emboss.default file but I don't know how to construct the queries. queries for sequence IDs are pretty simple, i.e. with a USA of the format "dbname:id". But, how to I create a query for the other fields, such as org and key? Also, do these fields support wildcards or substring matches or other fancy stuff? cheers Mike From pmr at ebi.ac.uk Wed Dec 12 11:21:51 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 12 Dec 2007 11:21:51 +0000 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> Message-ID: <475FC44F.7030700@ebi.ac.uk> Michael Thon wrote: > I am setting up a database from Genbank formatted files. I understand > how to index the db and configure the emboss.default file but I don't > know how to construct the queries. queries for sequence IDs are pretty > simple, i.e. with a USA of the format "dbname:id". But, how to I create > a query for the other fields, such as org and key? Also, do these > fields support wildcards or substring matches or other fancy stuff? Assuming you indexed all the fields (by default ID and ACC are indexed) you use the same syntax as in srs (we saw no need to invent a new syntax, so we used the same field name abbreviations but we did drop the '[]' around the query :-) dbname-acc:x13776 dbname-org:pseudomonas* dbname-des:amidase dbname-key: dbname-sv: dbname-gi: and, to complete the set, dbname-id:x13776 As you see, wildcards are allowed with '*' at the end. We can make this much more sophisticated, allowing more wildcard options and combining queries. So far EMBOSS users have been content to use SRS or alternatives (MRS for example). If there is interest, we can extend the USA to include wildcards, AND/OR/NOT, search multiple fields, combine databases, and if we get really ambitious we could include links between databases. We will have to be careful to restrict some of these extensions to database access methods that support them. Hope this helps, Peter From mike.thon at gmail.com Wed Dec 12 16:12:05 2007 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 12 Dec 2007 17:12:05 +0100 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <475FC44F.7030700@ebi.ac.uk> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> <475FC44F.7030700@ebi.ac.uk> Message-ID: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> Thanks Peter, I got it working. While I'm at it, a couple more questions popped up: 1) do you know if these indexes compatible with the Bio::DB::Registry type databases? 2) Is there any way to index and search sequence features? Best Mike On Dec 12, 2007, at 12:21 PM, Peter Rice wrote: > Michael Thon wrote: >> I am setting up a database from Genbank formatted files. I >> understand how to index the db and configure the emboss.default >> file but I don't know how to construct the queries. queries for >> sequence IDs are pretty simple, i.e. with a USA of the format >> "dbname:id". But, how to I create a query for the other fields, >> such as org and key? Also, do these fields support wildcards or >> substring matches or other fancy stuff? > > Assuming you indexed all the fields (by default ID and ACC are > indexed) > you use the same syntax as in srs (we saw no need to invent a new > syntax, so we used the same field name abbreviations but we did drop > the > '[]' around the query :-) > > dbname-acc:x13776 > dbname-org:pseudomonas* > dbname-des:amidase > dbname-key: > dbname-sv: > dbname-gi: > > and, to complete the set, dbname-id:x13776 > > As you see, wildcards are allowed with '*' at the end. > > We can make this much more sophisticated, allowing more wildcard > options > and combining queries. So far EMBOSS users have been content to use > SRS > or alternatives (MRS for example). > > If there is interest, we can extend the USA to include wildcards, > AND/OR/NOT, search multiple fields, combine databases, and if we get > really ambitious we could include links between databases. > > We will have to be careful to restrict some of these extensions to > database access methods that support them. > > Hope this helps, > > Peter From pmr at ebi.ac.uk Wed Dec 12 16:20:31 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 12 Dec 2007 16:20:31 +0000 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> <475FC44F.7030700@ebi.ac.uk> <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> Message-ID: <47600A4F.8030107@ebi.ac.uk> Michael Thon wrote: > Thanks Peter, I got it working. > While I'm at it, a couple more questions popped up: > 1) do you know if these indexes compatible with the Bio::DB::Registry > type databases? No ... well, we could add Bio::DB indices to the things EMBOSS can retrieve, then they would be :-) > 2) Is there any way to index and search sequence features? Not at present - but: 2a. what would you like to search for ... 2b. what would you like as the result ... 2b.i. if you want features, what do we call them? regards, Peter From mike.thon at gmail.com Fri Dec 14 17:28:59 2007 From: mike.thon at gmail.com (Michael Thon) Date: Fri, 14 Dec 2007 18:28:59 +0100 Subject: [EMBOSS] EMBOSS database queries In-Reply-To: <47600A4F.8030107@ebi.ac.uk> References: <8E65FD27-B3F0-4CBD-AC56-1A61E02A0871@gmail.com> <475FC44F.7030700@ebi.ac.uk> <4A6C16F2-F191-4F38-8FB5-6933F782077C@gmail.com> <47600A4F.8030107@ebi.ac.uk> Message-ID: <2F72B57C-A8C4-4A8E-84B6-5793764DBDD4@gmail.com> On Dec 12, 2007, at 5:20 PM, Peter Rice wrote: > Michael Thon wrote: >> Thanks Peter, I got it working. >> While I'm at it, a couple more questions popped up: >> 1) do you know if these indexes compatible with the >> Bio::DB::Registry type databases? > > No ... well, we could add Bio::DB indices to the things EMBOSS can > retrieve, then they would be :-) > >> 2) Is there any way to index and search sequence features? > > Not at present - but: > > 2a. what would you like to search for ... > 2b. what would you like as the result ... > 2b.i. if you want features, what do we call them? > Actually, I haven't given it much thought. But, for starters, one might want to retrieve proteins containing domain X, or that are annotated with interpro term Y. Perhaps some of this functionality could be accomplished though clever use of the key or des fields i.e. by putting all the Interpro terms assigned to a protein in the keyword field prior to indexing. One might also want to query a database of genomic DNA and fetch a translation of a gene or its spliced CDS. best Mike From bernd.web at gmail.com Mon Dec 17 20:01:32 2007 From: bernd.web at gmail.com (Bernd Web) Date: Mon, 17 Dec 2007 21:01:32 +0100 Subject: [EMBOSS] iep/gifasta Message-ID: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com> Hi, I'd like to run iep on a sequence and use either pir or osformat gifasta. The following gives an error (using emboss 5.0.0 on Debian): iep -filter -osformat gifasta -sequence seq.txt This returns "Died: Unknown qualifier -osformat" iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt also give an error: "Died: iep terminated: Bad value for '-sequence' with -auto defined" (with or without the sequence flag) However, iep -sformat fasta seq.txt works. What am I doing wrong? I'd like output to contain the accession number. I thought -osformat gifasta was for this purpose. My FastA definition line is e.g. >ENSG00000205090|1|protein_coding. The IEP report would me more useful if it contains the ENSG number instead of "protein coding or the entire definition line. How to do this? Kind regards, Bernd From pmr at ebi.ac.uk Tue Dec 18 09:23:18 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 18 Dec 2007 09:23:18 +0000 Subject: [EMBOSS] iep/gifasta In-Reply-To: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com> References: <716af09c0712171201j2a24c5c7g46a43877cbd326c0@mail.gmail.com> Message-ID: <47679186.6020003@ebi.ac.uk> Hi Bernd, Bernd Web wrote: > Hi, > > I'd like to run iep on a sequence and use either pir or osformat gifasta. > The following gives an error (using emboss 5.0.0 on Debian): > > iep -filter -osformat gifasta -sequence seq.txt > This returns "Died: Unknown qualifier -osformat" -osformat is for sequence outputs (and iep has no sequence outputs) iep writes a plain text file as output and no special options but we will add more information (accession and description) for a future release ... and to other plain text output files too. > iep -filter -sformat pir seq.txt or iep -sformat pir -sequence seq.txt > also give an error: > "Died: iep terminated: Bad value for '-sequence' with -auto defined" > (with or without the sequence flag) > > However, iep -sformat fasta seq.txt works. What am I doing wrong? It appears your sequence can be read in fasta format but not in pir format. PIR format has special characters after the first '>' > My FastA definition line is e.g. >> ENSG00000205090|1|protein_coding. > The IEP report would me more useful if it contains the ENSG number > instead of "protein coding or the entire definition line. Not a nice format. NCBI made up a lot of FASTA file identifiers with '|' characters and we try to follow their rules. That causes us to ignore the first part (it should be a database name) and reas the ID from the end. You could reformat the FASTA files (e.g. with a perl script) to remove the '|' characters and leave something useful as the plain ID (perhaps ENSG00000205090_1 in this case) and the rest as description. Hope that helps, Peter Rice From peter.robinson at t-online.de Thu Dec 20 15:08:59 2007 From: peter.robinson at t-online.de (Peter Robinson) Date: Thu, 20 Dec 2007 16:08:59 +0100 Subject: [EMBOSS] Seqall Datatype Message-ID: <476A858B.9080403@t-online.de> Dear EMBOSSERs, I am trying my hand at an EMBOSS program and would like to read in a list of sequences from a FASTA file and make pairwise comparisons between each sequence. If I startwith a AjPSeqall object AjPSeqall seqs=NULL; seqs = ajAcdGetSeqall ("seqs"); I have seen AjPSeq seq; while(ajSeqallNext(seqs, &seq)) { } in the documentation, but I would like to do something like a double for loop to get all pairwise comparisons. What is the best way of doing this? I have been searching in the online docs but did not yet find anything. By the way, in http://emboss.sourceforge.net/developers/program.html *17.2 Getting information from a sequence* *ajSeqGetName* get the name. This is a pointer to the internal AjPStr *ajSeqName* get the name. This is a pointer to the internal char* these datatypes are flagged as obsolete by the compiler, so the document may need revision here? Thanks, Peter Robinson From pmr at ebi.ac.uk Thu Dec 20 16:07:18 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 20 Dec 2007 16:07:18 +0000 Subject: [EMBOSS] Seqall Datatype In-Reply-To: <476A858B.9080403@t-online.de> References: <476A858B.9080403@t-online.de> Message-ID: <476A9336.9090504@ebi.ac.uk> Dear Peter, > I am trying my hand at an EMBOSS program and would like to read in a > list of sequences from a FASTA file and make pairwise comparisons > between each sequence. If I startwith a AjPSeqall object > > AjPSeqall seqs=NULL; > seqs = ajAcdGetSeqall ("seqs"); You want all the sequences in memory so you can work through the pairs - better to use AjPSeqset and ajAcdGetSeqset > in the documentation, but I would like to do something like a double for > loop to get all pairwise comparisons. What is the best way of doing > this? I have been searching in the online docs but did not yet find > anything. distmat has the kind of loop you are looking for (except it does self matches too) > By the way, in http://emboss.sourceforge.net/developers/program.html > > these datatypes are flagged as obsolete by the compiler, so the document > may need revision here? Yes, all being revised for the books we are preparing ... we will take a look through program.html and make some basic updates to correct these things. regards, Peter From peter.robinson at t-online.de Thu Dec 20 17:00:38 2007 From: peter.robinson at t-online.de (Peter Robinson) Date: Thu, 20 Dec 2007 18:00:38 +0100 Subject: [EMBOSS] Seqall Datatype In-Reply-To: <476A9336.9090504@ebi.ac.uk> References: <476A858B.9080403@t-online.de> <476A9336.9090504@ebi.ac.uk> Message-ID: <476A9FB6.6080407@t-online.de> Peter Rice wrote: > Dear Peter, > >> I am trying my hand at an EMBOSS program and would like to read in a >> list of sequences from a FASTA file and make pairwise comparisons >> between each sequence. If I startwith a AjPSeqall object >> >> AjPSeqall seqs=NULL; >> seqs = ajAcdGetSeqall ("seqs"); > > You want all the sequences in memory so you can work through the pairs > - better to use AjPSeqset and ajAcdGetSeqset > >> in the documentation, but I would like to do something like a double >> for loop to get all pairwise comparisons. What is the best way of >> doing this? I have been searching in the online docs but did not yet >> find anything. > > distmat has the kind of loop you are looking for (except it does self > matches too) > > >> By the way, in http://emboss.sourceforge.net/developers/program.html >> >> these datatypes are flagged as obsolete by the compiler, so the >> document may need revision here? > > Yes, all being revised for the books we are preparing ... we will take > a look through program.html and make some basic updates to correct > these things. > > regards, > > Peter > Dear Peter, thanks for the tip, that was just what I needed! best wishes for the holidays! Peter From staffa at niehs.nih.gov Thu Dec 20 21:44:02 2007 From: staffa at niehs.nih.gov (Staffa, Nick (NIH/NIEHS)) Date: Thu, 20 Dec 2007 16:44:02 -0500 Subject: [EMBOSS] newcpgreport Message-ID: I have been using EMBOSS newcpgreport by Rodrigo Lopez (rls ? ebi.ac.uk) European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html says: By default, this program defines a CpG island as a region where, over an average of 10 windows, the calculated % composition is over 50% and the calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum of 200 bases. These conditions can be modified by setting the values of the appropriate parameters. I may be very dull and unimaginative, but I'd sure like a more detailed explanation of what the program is doing to define a CpG island. Does anyone know where this might be found? Or even the code. Can anyone help please. Thanks Nick Staffa Telephone: 919-316-4569 (NIEHS: 6-4569) Scientific Computing Support Group NIEHS Information Technology Support Services Contract (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov) National Institute of Environmental Health Sciences National Institutes of Health Research Triangle Park, North Carolina From pmr at ebi.ac.uk Fri Dec 21 09:09:18 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 21 Dec 2007 09:09:18 +0000 Subject: [EMBOSS] newcpgreport In-Reply-To: References: Message-ID: <476B82BE.7000602@ebi.ac.uk> Staffa, Nick (NIH/NIEHS) wrote: > I have been using EMBOSS newcpgreport by > Rodrigo Lopez (rls ? ebi.ac.uk) > > I may be very dull and unimaginative, but I'd sure like a more detailed > explanation of what the program is doing to define a CpG island. > Does anyone know where this might be found? > Or even the code. The code is included in EMBOSS (as emboss/newcpgreport.c) The original reference for the CpG island criteria is in the paper listed in the "references" section of the newcpgreport documentation. Larsen F., Gundersen, G., Lopez L., Prydz H. "CpG island as Gene Markers in the Human Genome" Genomics 13:1095-1107 (1992) If memory serves, this refers to earlier work by Gardiner-Garden. If you need more information I am just along the corridor from Rodrigo's office ... once we're both back after Xmas :-) Hope that helps, Peter From rls at ebi.ac.uk Fri Dec 21 09:11:05 2007 From: rls at ebi.ac.uk (Rodrigo Lopez) Date: Fri, 21 Dec 2007 09:11:05 +0000 Subject: [EMBOSS] newcpgreport In-Reply-To: References: Message-ID: <476B8329.9060007@ebi.ac.uk> Hi, The relevant papers describing the method in detail are: PubMed:3656447 Gardiner-Garden M., Frommer M. CpG islands in vertebrate genomes. (20-Jul-1987) Journal of molecular biology, 196 (2) :261-82 PubMed:1505946 Larsen F., Gundersen G., Lopez R., Prydz H. CpG islands as gene markers in the human genome. (Aug-1992) Genomics, 13 (4) :1095-107 The source code - currently maintained by the EMBOSS team - is in the EMBOSS distribution. See your /EMBOSS-5.0.0/emboss/newcpgreport.c Hope this helps. Please do not hesitate to contact me if you have further queries. R:) Staffa, Nick (NIH/NIEHS) wrote: > I have been using EMBOSS newcpgreport by > Rodrigo Lopez (rls ? ebi.ac.uk) > European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, > Cambridge CB10 1SD, UK > > http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/newcpgreport.html > says: > By default, this program defines a CpG island as a region where, over an > average of 10 windows, the calculated % composition is over 50% and the > calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum > of 200 bases. These conditions can be modified by setting the values of the > appropriate parameters. > > I may be very dull and unimaginative, but I'd sure like a more detailed > explanation of what the program is doing to define a CpG island. > Does anyone know where this might be found? > Or even the code. > > Can anyone help please. > > Thanks > > Nick Staffa > Telephone: 919-316-4569 (NIEHS: 6-4569) > Scientific Computing Support Group > NIEHS Information Technology Support Services Contract > (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov) > National Institute of Environmental Health Sciences > National Institutes of Health > Research Triangle Park, North Carolina > > From gbottu at vub.ac.be Wed Dec 26 14:46:40 2007 From: gbottu at vub.ac.be (Guy Bottu) Date: Wed, 26 Dec 2007 15:46:40 +0100 Subject: [EMBOSS] extractalign Message-ID: <47726950.6070007@vub.ac.be> Dear all, I just noticed that EMBOSS version 5 contains a program extractalign, which extracts ranges from a multiple sequence alignment. This is certainly an interesting tool. The program is however not accompanied by an on-line manual and it is not mentioned in the Changelog. Any comment fom the developers ? Happy Christmas to you all, Guy Bottu, BEN From david at compbio.dundee.ac.uk Thu Dec 27 11:31:41 2007 From: david at compbio.dundee.ac.uk (David Martin) Date: Thu, 27 Dec 2007 11:31:41 +0000 Subject: [EMBOSS] Identifying sequence formats. Message-ID: Is there an easy way of identifying the format of a sequence using EMBOSS? It does wonderful autodetect but I'd like to be able to find out what it thinks the sequence format is for an arbitrary sequence. regards ..d -- David Martin PhD Post-Genomics and Molecular Interactions Centre University of Dundee http://www.compbio.dundee.ac.uk/ From pmr at ebi.ac.uk Fri Dec 28 10:20:27 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 28 Dec 2007 10:20:27 +0000 Subject: [EMBOSS] Identifying sequence formats. In-Reply-To: References: Message-ID: <4774CDEB.4040407@ebi.ac.uk> David Martin wrote: > Is there an easy way of identifying the format of a sequence using EMBOSS? > It does wonderful autodetect but I'd like to be able to find out what it > thinks the sequence format is for an arbitrary sequence. The information is stored so you can craft a little application to print out the value of the FormatStr attribute. There may be some oddities .... it automatically switches between EMBL/SwissProt and FASTA/NCBI formats depending on the first line. Let us know and we can look to apply corrections. Season's greetings and all the best for the New Year Peter From pmr at ebi.ac.uk Fri Dec 28 10:37:07 2007 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 28 Dec 2007 10:37:07 +0000 Subject: [EMBOSS] extractalign In-Reply-To: <47726950.6070007@vub.ac.be> References: <47726950.6070007@vub.ac.be> Message-ID: <4774D1D3.4050503@ebi.ac.uk> Guy Bottu wrote: > Dear all, > > I just noticed that EMBOSS version 5 contains a program extractalign, which > extracts ranges from a multiple sequence alignment. This is certainly an > interesting tool. The program is however not accompanied by an on-line > manual > and it is not mentioned in the Changelog. Any comment fom the developers ? Well ... it is accompanied by an online manual .... just not included in the programs index. edialign and wordfinder were also missing. Now to update the ChangeLog (wordfinder is missing there too)... Season's greetings and Happy New Year Peter