From hpm at bioinfo-user.org.uk  Sun Dec  3 10:06:25 2006
From: hpm at bioinfo-user.org.uk (Hamish McWilliam)
Date: Sun, 03 Dec 2006 15:06:25 +0000
Subject: [EMBOSS] EMBOSS database setup
In-Reply-To: <43964.81.98.244.247.1164904060.squirrel@webmail.ebi.ac.uk>
References: <639b80db0611281217i6c1cc927v50ac9b8e6a71717c@mail.gmail.com>	<1164902235.14146.57.camel@emboss2.ebi.ac.uk>
	<43964.81.98.244.247.1164904060.squirrel@webmail.ebi.ac.uk>
Message-ID: <4572E7F1.5080709@bioinfo-user.org.uk>

Hi Alan,

> The appended definitions are simple ones that may be
> useful if you only want a few sequences at a time.
> If sites upgrade to SRS8 then alter accordingly.
> 
> Alan
> 
> DB embl [  type: N method: srswww format: embl release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "EMBL from the EBI" ]
> 
> DB em [  type: N method: srswww format: embl release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   dbalias: "EMBL"
>   comment: "EMBL from the EBI" ]
> 
> DB swissprot [  type: P method: srswww format: swiss release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "SWISSPROT from the EBI" ]
> 
> DB sw [  type: P method: srswww format: swiss release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   dbalias: "SWISSPROT"
>   comment: "SWISSPROT from the EBI" ]
> 
> DB uniprot [  type: P method: srswww format: swiss release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "UNIPROT from the EBI" ]
> 
> DB uni [  type: P method: srswww format: swiss release: EBI
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   dbalias: "UNIPROT"
>   comment: "UNIPROT from the EBI" ]
> 
> DB pir [  type: P method: srswww format: nbrf release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "PIR from the EBI" ]
> 
> DB genbank [  type: N method: srswww format: genbank release: "NCBI"
>   url: "http://www.infobiogen.fr/srs7bin/cgi-bin/wgetz"
>   comment: "GenBank from Infobiogen" ]
> 
> DB gb [  type: N method: srswww format: genbank release: "NCBI"
>   url: "http://www.infobiogen.fr/srs7bin/cgi-bin/wgetz"
>   dbalias: "GENBANK"
>   comment: "GenBank from Infobiogen" ]
> 
> DB refseq [  type: N method: srswww format: genbank release: "NCBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "REFSEQ from EBI" ]

For the EBI's SRS server please use:

   http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz

as the URL. This should allow for continued support when the server is 
upgraded.

Also note that the Infobiogen SRS service is no longer available. For 
other SRS sites carrying GenBank please see the Public SRS Server List 
(http://downloads.biowisdomsrs.com/publicsrs.html).

Hamish


From maoj at helix.nih.gov  Mon Dec  4 10:49:26 2006
From: maoj at helix.nih.gov (Jean Mao)
Date: Mon, 4 Dec 2006 10:49:26 -0500
Subject: [EMBOSS] Application for PFAM?
Message-ID: <000501c717bb$c6d353d0$be4de780@CIT.NIH.GOV>

Hi,
Just wondering if EMBOSS has any program that will search a pfam database? 


From David.Bauer at SCHERING.DE  Tue Dec  5 01:40:54 2006
From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE)
Date: Tue, 5 Dec 2006 07:40:54 +0100
Subject: [EMBOSS] Antwort:  Application for PFAM?
In-Reply-To: <000501c717bb$c6d353d0$be4de780@CIT.NIH.GOV>
Message-ID: <OF66C1DF6D.0A27E6D3-ONC125723B.00244163-C125723B.0024B45A@schering.de>

Hi Jean,

not in the core EMBOSS but there is the HMMER-2.3.2 embassy application.
This contains also the program ehmmpfam to search a sequence against the
Pfam HMM database.

HTH,
David.

emboss-bounces at lists.open-bio.org schrieb am 04/12/2006 16:49:26:

> Hi,
> Just wondering if EMBOSS has any program that will search a pfam
database?
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From JK at novozymes.com  Wed Dec  6 07:34:38 2006
From: JK at novozymes.com (JK (Jesper Agerbo Krogh))
Date: Wed, 6 Dec 2006 13:34:38 +0100
Subject: [EMBOSS] Output from seqret in fastaformat.
Message-ID: <934F95E71B6C9347A873C42AE3C196190B84C672@NZT0004E.dknz.nzcorp.net>


Hi.. 

I've godt dbxflat to index the swissprot database.. but I'd like to have the output 
formatted with the USA as the fasta ID. 

Current..:

seqret UNIPROT:Q12345
Reads and writes (returns) sequences
output sequence(s) [ies3_yeast.fasta]:

>IES3_YEAST Q12345 Ino eighty subunit 3.
MKFEDLLATNKQVQFAHAATQHYKSVKTPDFLEKDPHHKKFHNADGLNQQGSSTPSTATD
ANAASTASTHTNTTTFKRHIVAVDDISKMNYEMIKNSPGNVITNANQDEIDISTLKTRLY
KDNLYAMNDNFLQAVNDQIVTLNAAEQDQETEDPDLSDDEKIDILTKIQENLLEEYQKLS
QKERKWFILKELLLDANVELDLFSNRGRKASHPIAFGAVAIPTNVNANSLAFNRTKRRKI
NKNGLLENIL

.. but I'd like.. 

>UNIPROT:Q12345 Ino eighty subunit 3.
MKFEDLLATNKQVQFAHAATQHYKSVKTPDFLEKDPHHKKFHNADGLNQQGSSTPSTATD
ANAASTASTHTNTTTFKRHIVAVDDISKMNYEMIKNSPGNVITNANQDEIDISTLKTRLY
KDNLYAMNDNFLQAVNDQIVTLNAAEQDQETEDPDLSDDEKIDILTKIQENLLEEYQKLS
QKERKWFILKELLLDANVELDLFSNRGRKASHPIAFGAVAIPTNVNANSLAFNRTKRRKI
NKNGLLENIL

Is that possible? 


-- 
Jesper Krogh


From maoj at helix.nih.gov  Wed Dec  6 09:55:33 2006
From: maoj at helix.nih.gov (Jean Mao)
Date: Wed, 6 Dec 2006 09:55:33 -0500
Subject: [EMBOSS] Question regarding dbxflat entry number processed
Message-ID: <000a01c71946$94beb600$be4de780@CIT.NIH.GOV>

Hi, I am using dbxflat to index a database. I would like to find out how
many entries were processed. In the index file database.pxid, there is a
line :

Count      123456

which is very close to the number of entries in the database file but not
exact the same. Is there a way to find out? Thank you very much.

Jean Mao

From ajb at ebi.ac.uk  Wed Dec  6 10:23:44 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Wed, 6 Dec 2006 15:23:44 -0000 (GMT)
Subject: [EMBOSS] Question regarding dbxflat entry number processed
In-Reply-To: <000a01c71946$94beb600$be4de780@CIT.NIH.GOV>
References: <000a01c71946$94beb600$be4de780@CIT.NIH.GOV>
Message-ID: <48753.81.98.244.247.1165418624.squirrel@webmail.ebi.ac.uk>

Hi Jean,

Usually you just need to add 1 to the 'Count' value as the counting
works from 0 to n-1 rather than 1 to n. However, if there are duplicate
keys in the database then that cannot be relied upon: the
Count is representative of the number of unique keys at the top level
of the tree and does not include any duplicates indexed in a subtree.

HTH

Alan


From JK at novozymes.com  Wed Dec  6 10:33:11 2006
From: JK at novozymes.com (JK (Jesper Agerbo Krogh))
Date: Wed, 6 Dec 2006 16:33:11 +0100
Subject: [EMBOSS] Output from seqret in fastaformat.
In-Reply-To: <4576C594.3080609@ebi.ac.uk>
Message-ID: <934F95E71B6C9347A873C42AE3C196191386B8DD@NZT0004E.dknz.nzcorp.net>

Hi,
> 
> Use -osdbname UNIPROT in the command line.

That sort of works... but that gives me the DATABASE:ID not the
DATABASE:AC in the fasta-header. 

Whats the actual difference between the id and the accessionnumbers?

-- 
Jesper Krogh


From pmr at ebi.ac.uk  Wed Dec  6 12:30:58 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Wed, 6 Dec 2006 17:30:58 -0000 (GMT)
Subject: [EMBOSS] Output from seqret in fastaformat.
In-Reply-To: <934F95E71B6C9347A873C42AE3C196191386B8DD@NZT0004E.dknz.nzcorp.net>
References: <4576C594.3080609@ebi.ac.uk>
	<934F95E71B6C9347A873C42AE3C196191386B8DD@NZT0004E.dknz.nzcorp.net>
Message-ID: <14912.193.173.109.1.1165426258.squirrel@webmail.ebi.ac.uk>

Hi Jesper,

>> Use -osdbname UNIPROT in the command line.
>
> That sort of works... but that gives me the DATABASE:ID not the
> DATABASE:AC in the fasta-header.

Yup, you need to redefine the ID as well with -sid Q12345

> Whats the actual difference between the id and the accessionnumbers?

The id is the identifier on the ID line of the entry.

The accession number is from the AC line - also a unique identifier but
completely unmemorable. Given the choice, we prefer the real ID. Entries
can also have more than one accession number (more common for EMBL entries
than for UniProt) where entries are merged or changed.

entret will show you the full entry so you can see where the identifiers
come from.

Hope that helps,

Peter


From mincloud at gmail.com  Thu Dec  7 15:36:03 2006
From: mincloud at gmail.com (yun zheng)
Date: Thu, 7 Dec 2006 14:36:03 -0600
Subject: [EMBOSS] how to find unique DNA sequences from a large database
Message-ID: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>

Hi,

Are there any tools for find unique sequences from a large database? Many
thanks.

I need to find unique DNA sequences from a large database. A short piece is
given as follows.

>001
aaaagttgtgtgtgtatgacaggtt
>013
aacctgtcatacacacacaactttt
>289
gttgtgtgtgtatgacaggtt
>375
tgtgtgtatgacaggttgat
>319
tcaacctgtcatacacaca
>177
cgcagtgtgtgtatgacagg
>271
gtcctacctgtcatacacac
>020
aagacataatgtgtgtatgacag

All these seem to be the same sequence, since BLASTN gives very small
e-values for their alignments.

BLASTN 2.2.8 [Jan-05-2004]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= 001
         (25 letters)

Database: drought-clustered.fa
           410 sequences; 8877 total letters

Searching.done

                                                                 Score    E
Sequences producing significant alignments:                      (bits)
Value

013                                                                    50
8e-11
001                                                                    50
8e-11
289                                                                    42
2e-08
375                                                                    34
5e-06
319                                                                    34
5e-06
177                                                                    32
2e-05
271                                                                    30
8e-05
020                                                                    28
3e-04

Best regards.

sincerely

Zheng, Yun

Department of Computer Science

Washington University in St Louis

Campus Box 1045

1 Brookings Drive, St Louis, MO 63130

From mthon at tamu.edu  Thu Dec  7 18:55:00 2006
From: mthon at tamu.edu (Michael Thon)
Date: Thu, 7 Dec 2006 17:55:00 -0600
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
Message-ID: <7F1F24A9-10FD-462D-BD63-349AD4538EB9@tamu.edu>

Hi Yun , you might try a clustering algorithm like blastclust (single  
linkage clustering) or mcl (a.k.a tribe-mcl) or one of the others  
that exist.  I can't think of any EMBOSS apps that would solve this  
problem, but maybe someone else has a better answer.
Mike


On Dec 7, 2006, at 2:36 PM, yun zheng wrote:

> Hi,
>
> Are there any tools for find unique sequences from a large  
> database? Many
> thanks.
>
> I need to find unique DNA sequences from a large database. A short  
> piece is
> given as follows.
>
>> 001
> aaaagttgtgtgtgtatgacaggtt
>> 013
> aacctgtcatacacacacaactttt
>> 289
> gttgtgtgtgtatgacaggtt
>> 375
> tgtgtgtatgacaggttgat
>> 319
> tcaacctgtcatacacaca
>> 177
> cgcagtgtgtgtatgacagg
>> 271
> gtcctacctgtcatacacac
>> 020
> aagacataatgtgtgtatgacag
>
> All these seem to be the same sequence, since BLASTN gives very small
> e-values for their alignments.
>
> BLASTN 2.2.8 [Jan-05-2004]
>
>
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.  
> Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> "Gapped BLAST and PSI-BLAST: a new generation of protein database  
> search
> programs",  Nucleic Acids Res. 25:3389-3402.
>
> Query= 001
>          (25 letters)
>
> Database: drought-clustered.fa
>            410 sequences; 8877 total letters
>
> Searching.done
>
>                                                                   
> Score    E
> Sequences producing significant alignments:                       
> (bits)
> Value
>
> 013                                                                    
>  50
> 8e-11
> 001                                                                    
>  50
> 8e-11
> 289                                                                    
>  42
> 2e-08
> 375                                                                    
>  34
> 5e-06
> 319                                                                    
>  34
> 5e-06
> 177                                                                    
>  32
> 2e-05
> 271                                                                    
>  30
> 8e-05
> 020                                                                    
>  28
> 3e-04
>
> Best regards.
>
> sincerely
>
> Zheng, Yun
>
> Department of Computer Science
>
> Washington University in St Louis
>
> Campus Box 1045
>
> 1 Brookings Drive, St Louis, MO 63130
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From ztu at msi.umn.edu  Thu Dec  7 20:30:38 2006
From: ztu at msi.umn.edu (Zheng Jin Tu)
Date: Thu, 7 Dec 2006 19:30:38 -0600 (CST)
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <7F1F24A9-10FD-462D-BD63-349AD4538EB9@tamu.edu>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
	<7F1F24A9-10FD-462D-BD63-349AD4538EB9@tamu.edu>
Message-ID: <Pine.LNX.4.61.0612071917590.21009@virga.msi.umn.edu>


Although these are not the good ways to do,
they are the workable solutions:

First, for each sequence in your database, make
a long string of sequence.  Then use a for loop
scan over your long sequence string with the
window size of your search sequence.  You do
all for each sequences in the database.  It
may take a few days if you need to scan big
databases such as human genome.

The other way is to elongate your short query
to 17 or 21 nt (not sure which is the shortest one
that blast works) long where blast can search.
That means, if you have 15 nt oligo, you can
creat four x four possible 17 nt sequences.
Such as:

   AAACCCGGGC CCTTTAAaa
   AAACCCGGGC CCTTTAAag
   AAACCCGGGC CCTTTAAac
   AAACCCGGGC CCTTTAAat
   AAACCCGGGC CCTTTAAga
   AAACCCGGGC CCTTTAAgg
   AAACCCGGGC CCTTTAAgc
   AAACCCGGGC CCTTTAAgt
   AAACCCGGGC CCTTTAAca
   AAACCCGGGC CCTTTAAct
   AAACCCGGGC CCTTTAAcg
   AAACCCGGGC CCTTTAAcc
   .....

Then you run blast and combine all results from
16 17-nt sequences as the hits for your 15 nt
query sequence.

Hope this useful.


Thanks,  TU

==================================

On Thu, 7 Dec 2006, Michael Thon wrote:

> Hi Yun , you might try a clustering algorithm like blastclust (single
> linkage clustering) or mcl (a.k.a tribe-mcl) or one of the others
> that exist.  I can't think of any EMBOSS apps that would solve this
> problem, but maybe someone else has a better answer.
> Mike
>
>
> On Dec 7, 2006, at 2:36 PM, yun zheng wrote:
>
>> Hi,
>>
>> Are there any tools for find unique sequences from a large
>> database? Many
>> thanks.
>>
>> I need to find unique DNA sequences from a large database. A short
>> piece is
>> given as follows.
>>
>>> 001
>> aaaagttgtgtgtgtatgacaggtt
>>> 013
>> aacctgtcatacacacacaactttt
>>> 289
>> gttgtgtgtgtatgacaggtt
>>> 375
>> tgtgtgtatgacaggttgat
>>> 319
>> tcaacctgtcatacacaca
>>> 177
>> cgcagtgtgtgtatgacagg
>>> 271
>> gtcctacctgtcatacacac
>>> 020
>> aagacataatgtgtgtatgacag
>>
>> All these seem to be the same sequence, since BLASTN gives very small
>> e-values for their alignments.
>>
>> BLASTN 2.2.8 [Jan-05-2004]
>>
>>
>> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
>> Schaffer,
>> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
>> "Gapped BLAST and PSI-BLAST: a new generation of protein database
>> search
>> programs",  Nucleic Acids Res. 25:3389-3402.
>>
>> Query= 001
>>          (25 letters)
>>
>> Database: drought-clustered.fa
>>            410 sequences; 8877 total letters
>>
>> Searching.done
>>
>>
>> Score    E
>> Sequences producing significant alignments:
>> (bits)
>> Value
>>
>> 013
>>  50
>> 8e-11
>> 001
>>  50
>> 8e-11
>> 289
>>  42
>> 2e-08
>> 375
>>  34
>> 5e-06
>> 319
>>  34
>> 5e-06
>> 177
>>  32
>> 2e-05
>> 271
>>  30
>> 8e-05
>> 020
>>  28
>> 3e-04
>>
>> Best regards.
>>
>> sincerely
>>
>> Zheng, Yun
>>
>> Department of Computer Science
>>
>> Washington University in St Louis
>>
>> Campus Box 1045
>>
>> 1 Brookings Drive, St Louis, MO 63130
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From pmr at ebi.ac.uk  Fri Dec  8 03:37:32 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Fri, 8 Dec 2006 08:37:32 -0000 (GMT)
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
Message-ID: <1636.217.44.134.240.1165567052.squirrel@webmail.ebi.ac.uk>

Dear Yun Zheng,

> Are there any tools for find unique sequences from a large database? Many
> thanks.
>
> I need to find unique DNA sequences from a large database. A short piece
> is
> given as follows.
>

> All these seem to be the same sequence, since BLASTN gives very small
> e-values for their alignments.

Remember than BLASTN is a local alignment tool. The small e-values
indicate that some part of your 001 query sequence is similar to some part
of a sequence in the database.

You need to check what is matching in the alignments reported by BLASTN.
One useful test is whether the whole length of your query is matching to
any of the sequences in the database, also for DNA whether it is matching
in one or both directions (as sequences can have biologically significant
inverted repeats).

There are tools (not in EMBOSS) available for building non-redundant
databases - excluding sequences which are subsequences of others in the
database, or selecting one of a set of sequences that match closely over
their whole length. But you do have to decide what you mean by redundancy
and make sure that the methods you apply are appropriate.

Hope that helps,

Peter Rice


From mincloud at gmail.com  Fri Dec  8 13:50:40 2006
From: mincloud at gmail.com (yun zheng)
Date: Fri, 8 Dec 2006 12:50:40 -0600
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <1636.217.44.134.240.1165567052.squirrel@webmail.ebi.ac.uk>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
	<1636.217.44.134.240.1165567052.squirrel@webmail.ebi.ac.uk>
Message-ID: <8f6eb9540612081050n2e9b745lb28b79eb9dffb82f@mail.gmail.com>

Dear All,

Many thanks for your reply.

Best regards.

sincerely
zheng, yun


On 12/8/06, pmr at ebi.ac.uk <pmr at ebi.ac.uk> wrote:
>
> Dear Yun Zheng,
>
> > Are there any tools for find unique sequences from a large database?
> Many
> > thanks.
> >
> > I need to find unique DNA sequences from a large database. A short piece
> > is
> > given as follows.
> >
>
> > All these seem to be the same sequence, since BLASTN gives very small
> > e-values for their alignments.
>
> Remember than BLASTN is a local alignment tool. The small e-values
> indicate that some part of your 001 query sequence is similar to some part
> of a sequence in the database.
>
> You need to check what is matching in the alignments reported by BLASTN.
> One useful test is whether the whole length of your query is matching to
> any of the sequences in the database, also for DNA whether it is matching
> in one or both directions (as sequences can have biologically significant
> inverted repeats).
>
> There are tools (not in EMBOSS) available for building non-redundant
> databases - excluding sequences which are subsequences of others in the
> database, or selecting one of a set of sequences that match closely over
> their whole length. But you do have to decide what you mean by redundancy
> and make sure that the methods you apply are appropriate.
>
> Hope that helps,
>
> Peter Rice
>
>

From bobwohlhueter at earthlink.net  Tue Dec 12 17:32:13 2006
From: bobwohlhueter at earthlink.net (Robert Wohlhueter)
Date: Tue, 12 Dec 2006 17:32:13 -0500
Subject: [EMBOSS] jemboss standalone installation problems on MacBook/Intel
Message-ID: <457F2DED.7040903@earthlink.net>

Trying to run emboss/jemboss as a standalone on MacBookPro/intel under
OS 10.4.  Downloaded and installed from fink.sourceforge.net (in fink's
preferred /sw/share/EMBOSS tree).  [In trying to circumvent the proglems
described below, I also built, in the /usr/local tree, from source code
at emboss.sourceforge.net, an entirely separate installation.  I get
exactly the same set of errors with that installation.]

Facts as best I comprehend them:
1) the envvars, including CLASSPATH, JEMBOSS_HOME, EMBSS_INSTALL, etc.,
are set as specified in $EMBOSS_INSTALL/jemboss/runJemboss.sh, and seem
correct to me.  jars, *.class, *.java, and the several executables that
comprise the emboss suite of applications, as near as I can tell are all
present.

2) java runtime is that bundled by Apple with OS 10.4, namely
   {summer:~}66 bobw$ java -version
   java version "1.5.0_06"
   Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-112)
   Java HotSpot(TM) Client VM (build 1.5.0_06-64, mixed mode, sharing)

3) As suggested in Emboss Administrator's Guide, testing the
installation with
   $ wossname -auto
works as expected, listing countless applications in various categories.

4) When I try to run `java $JEMBOSS_HOME/org/emboss/jemboss/Jemboss local &`
   {summer:~}68 bobw$ jemboss
   [1] 736
   {summer:~}69 bobw$ Exception in thread "main"
java.lang.NoClassDefFoundError:
       /sw/share/EMBOSS/jemboss/org/emboss/jemboss/Jemboss
I'll worry about this later.

5) The nub of the problem is when I run `java org.emboss.jemboss.Jemboss
local &` from JEMBOSS_HOME, I get the following messages:

{summer:/sw/share/EMBOSS/jemboss}60 bobw$ Exception in thread "Thread-2"
java.lang.NullPointerException
       at
org.emboss.jemboss.gui.BuildProgramMenu$1.construct(BuildProgramMenu.java:278)
       at org.emboss.jemboss.gui.SwingWorker$2.run(SwingWorker.java:127)
       at java.lang.Thread.run(Thread.java:613)

I'm not a java programmer, but when I look at source code in
BuildProgramMenu.java, it looks like the error arises in a routine which
is trying to construct a "dataFile" file specification from data in a
object called "mysettings".  I don't see how/where "mysettings" is
defined, but I'm suspicious that it is intended to read data from my
local settings (envvars, emboss.defaults ??), is not able to, and thus
passes null information to the new datafile specification.

Can anybody elucidate the source of data in "mysettings" and give me a
hint what I need to do to supply it?

Thanks for any and all pointers,

Bob Wohlhueter


From tjc at sanger.ac.uk  Wed Dec 13 02:11:18 2006
From: tjc at sanger.ac.uk (Tim Carver)
Date: Wed, 13 Dec 2006 07:11:18 +0000
Subject: [EMBOSS] jemboss standalone installation problems on
 MacBook/Intel
In-Reply-To: <457F2DED.7040903@earthlink.net>
Message-ID: <C1A55816.2C2D1%tjc@sanger.ac.uk>

Hi Robert

Jemboss has not been set up to work with fink. You do need to use the EMBOSS
download (including any patches) and install using the script as described
at:

http://emboss.sourceforge.net/Jemboss/install/standalone.html

Also make sure you have the latest java from:

http://www.apple.com/downloads/macosx/apple/j2se50release4intel.html

Regards
Tim Carver


On 12/12/06 22:32, "Robert Wohlhueter" <bobwohlhueter at earthlink.net> wrote:

> Trying to run emboss/jemboss as a standalone on MacBookPro/intel under
> OS 10.4.  Downloaded and installed from fink.sourceforge.net (in fink's
> preferred /sw/share/EMBOSS tree).  [In trying to circumvent the proglems
> described below, I also built, in the /usr/local tree, from source code
> at emboss.sourceforge.net, an entirely separate installation.  I get
> exactly the same set of errors with that installation.]
> 
> Facts as best I comprehend them:
> 1) the envvars, including CLASSPATH, JEMBOSS_HOME, EMBSS_INSTALL, etc.,
> are set as specified in $EMBOSS_INSTALL/jemboss/runJemboss.sh, and seem
> correct to me.  jars, *.class, *.java, and the several executables that
> comprise the emboss suite of applications, as near as I can tell are all
> present.
> 
> 2) java runtime is that bundled by Apple with OS 10.4, namely
>    {summer:~}66 bobw$ java -version
>    java version "1.5.0_06"
>    Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-112)
>    Java HotSpot(TM) Client VM (build 1.5.0_06-64, mixed mode, sharing)
> 
> 3) As suggested in Emboss Administrator's Guide, testing the
> installation with
>    $ wossname -auto
> works as expected, listing countless applications in various categories.
> 
> 4) When I try to run `java $JEMBOSS_HOME/org/emboss/jemboss/Jemboss local &`
>    {summer:~}68 bobw$ jemboss
>    [1] 736
>    {summer:~}69 bobw$ Exception in thread "main"
> java.lang.NoClassDefFoundError:
>        /sw/share/EMBOSS/jemboss/org/emboss/jemboss/Jemboss
> I'll worry about this later.
> 
> 5) The nub of the problem is when I run `java org.emboss.jemboss.Jemboss
> local &` from JEMBOSS_HOME, I get the following messages:
> 
> {summer:/sw/share/EMBOSS/jemboss}60 bobw$ Exception in thread "Thread-2"
> java.lang.NullPointerException
>        at
> org.emboss.jemboss.gui.BuildProgramMenu$1.construct(BuildProgramMenu.java:278)
>        at org.emboss.jemboss.gui.SwingWorker$2.run(SwingWorker.java:127)
>        at java.lang.Thread.run(Thread.java:613)
> 
> I'm not a java programmer, but when I look at source code in
> BuildProgramMenu.java, it looks like the error arises in a routine which
> is trying to construct a "dataFile" file specification from data in a
> object called "mysettings".  I don't see how/where "mysettings" is
> defined, but I'm suspicious that it is intended to read data from my
> local settings (envvars, emboss.defaults ??), is not able to, and thus
> passes null information to the new datafile specification.
> 
> Can anybody elucidate the source of data in "mysettings" and give me a
> hint what I need to do to supply it?
> 
> Thanks for any and all pointers,
> 
> Bob Wohlhueter
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From pandey.gaurav at gmail.com  Wed Dec 13 23:43:06 2006
From: pandey.gaurav at gmail.com (Gaurav Pandey)
Date: Wed, 13 Dec 2006 22:43:06 -0600
Subject: [EMBOSS] New Review: Computational Approaches for Protein Function
	Prediction
Message-ID: <627ca1900612132043l48a54760ibda9de67fe35a8f5@mail.gmail.com>

 [Apologies if you receive this more than once]

Dear Colleague,

We are pleased to share with you a recent review of several hundred papers
in the field of computational protein function prediction:

Title: Computational Approaches for Protein Function Prediction: A Survey

Authors: Gaurav Pandey <http://www.cs.umn.edu/%7Egaurav>, Vipin
Kumar<http://www.cs.umn.edu/%7Ekumar>and Michael
Steinbach <http://www.cs.umn.edu/%7Esteinbac>

Available at: http://www.cs.umn.edu/~kumar/papers/survey.php<http://www.cs.umn.edu/%7Ekumar/papers/survey.php>

Abstract

Proteins are the most essential and versatile macromolecules of life, and
the knowledge of their functions is a crucial link in the development of new
drugs, better crops, and even the development of synthetic biochemicals such
as biofuels. Experimental procedures for protein function prediction are
inherently low throughput and are thus unable to annotate a non-trivial
fraction of proteins that are becoming available due to rapid advances in
genome sequencing technology. This has motivated the development of
computational techniques that utilize a variety of high-throughput
experimental data for protein function prediction, such as protein and
genome sequences, gene expression data, protein interaction networks and
phylogenetic profiles. Indeed, in a short period of a decade, several
hundred articles have been published on this topic. This review aims to
discuss this wide spectrum of approaches
by categorizing them in terms of the data type they use for predicting
function, and thus identify the trends and needs of this very important
field. The survey is expected to be useful for computational biologists and
bioinformaticians aiming to get an overview of the field of computational
function prediction, and identify areas that can benefit from further
research.


Your comments on the article, or any part thereof, are welcome.

Thanks and best regards

Gaurav Pandey (gaurav at cs.umn.edu)
Vipin Kumar (kumar at cs.umn.edu)
Michael Steinbach (steinbac at cs.umn.edu)

From msarachu at biol.unlp.edu.ar  Wed Dec 27 09:01:37 2006
From: msarachu at biol.unlp.edu.ar (=?iso-8859-1?b?TWFydO1u?= Sarachu)
Date: Wed, 27 Dec 2006 11:01:37 -0300
Subject: [EMBOSS] wEMBOSS-1.7.1 released
Message-ID: <1167228097.45927cc185874@webmail.biol.unlp.edu.ar>

This release is mainly to fix a problem with editing nucList and protList and
also includes some minor changes.
wrappers4EMBOSS-1.5.1 is included in this wEMBOSS release.
wEMBOSS can be downloaded from http://www.wemboss.org

Best regards,

The wEMBOSS team

-- 
Mart?n Sarachu
msarachu at biol.unlp.edu.ar
EMBnet Argentina
http://www.ar.embnet.org

From hpm at bioinfo-user.org.uk  Sun Dec  3 15:06:25 2006
From: hpm at bioinfo-user.org.uk (Hamish McWilliam)
Date: Sun, 03 Dec 2006 15:06:25 +0000
Subject: [EMBOSS] EMBOSS database setup
In-Reply-To: <43964.81.98.244.247.1164904060.squirrel@webmail.ebi.ac.uk>
References: <639b80db0611281217i6c1cc927v50ac9b8e6a71717c@mail.gmail.com>	<1164902235.14146.57.camel@emboss2.ebi.ac.uk>
	<43964.81.98.244.247.1164904060.squirrel@webmail.ebi.ac.uk>
Message-ID: <4572E7F1.5080709@bioinfo-user.org.uk>

Hi Alan,

> The appended definitions are simple ones that may be
> useful if you only want a few sequences at a time.
> If sites upgrade to SRS8 then alter accordingly.
> 
> Alan
> 
> DB embl [  type: N method: srswww format: embl release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "EMBL from the EBI" ]
> 
> DB em [  type: N method: srswww format: embl release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   dbalias: "EMBL"
>   comment: "EMBL from the EBI" ]
> 
> DB swissprot [  type: P method: srswww format: swiss release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "SWISSPROT from the EBI" ]
> 
> DB sw [  type: P method: srswww format: swiss release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   dbalias: "SWISSPROT"
>   comment: "SWISSPROT from the EBI" ]
> 
> DB uniprot [  type: P method: srswww format: swiss release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "UNIPROT from the EBI" ]
> 
> DB uni [  type: P method: srswww format: swiss release: EBI
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   dbalias: "UNIPROT"
>   comment: "UNIPROT from the EBI" ]
> 
> DB pir [  type: P method: srswww format: nbrf release: "EBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "PIR from the EBI" ]
> 
> DB genbank [  type: N method: srswww format: genbank release: "NCBI"
>   url: "http://www.infobiogen.fr/srs7bin/cgi-bin/wgetz"
>   comment: "GenBank from Infobiogen" ]
> 
> DB gb [  type: N method: srswww format: genbank release: "NCBI"
>   url: "http://www.infobiogen.fr/srs7bin/cgi-bin/wgetz"
>   dbalias: "GENBANK"
>   comment: "GenBank from Infobiogen" ]
> 
> DB refseq [  type: N method: srswww format: genbank release: "NCBI"
>   url: "http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz"
>   comment: "REFSEQ from EBI" ]

For the EBI's SRS server please use:

   http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz

as the URL. This should allow for continued support when the server is 
upgraded.

Also note that the Infobiogen SRS service is no longer available. For 
other SRS sites carrying GenBank please see the Public SRS Server List 
(http://downloads.biowisdomsrs.com/publicsrs.html).

Hamish


From maoj at helix.nih.gov  Mon Dec  4 15:49:26 2006
From: maoj at helix.nih.gov (Jean Mao)
Date: Mon, 4 Dec 2006 10:49:26 -0500
Subject: [EMBOSS] Application for PFAM?
Message-ID: <000501c717bb$c6d353d0$be4de780@CIT.NIH.GOV>

Hi,
Just wondering if EMBOSS has any program that will search a pfam database? 


From David.Bauer at SCHERING.DE  Tue Dec  5 06:40:54 2006
From: David.Bauer at SCHERING.DE (David.Bauer at SCHERING.DE)
Date: Tue, 5 Dec 2006 07:40:54 +0100
Subject: [EMBOSS] Antwort:  Application for PFAM?
In-Reply-To: <000501c717bb$c6d353d0$be4de780@CIT.NIH.GOV>
Message-ID: <OF66C1DF6D.0A27E6D3-ONC125723B.00244163-C125723B.0024B45A@schering.de>

Hi Jean,

not in the core EMBOSS but there is the HMMER-2.3.2 embassy application.
This contains also the program ehmmpfam to search a sequence against the
Pfam HMM database.

HTH,
David.

emboss-bounces at lists.open-bio.org schrieb am 04/12/2006 16:49:26:

> Hi,
> Just wondering if EMBOSS has any program that will search a pfam
database?
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From JK at novozymes.com  Wed Dec  6 12:34:38 2006
From: JK at novozymes.com (JK (Jesper Agerbo Krogh))
Date: Wed, 6 Dec 2006 13:34:38 +0100
Subject: [EMBOSS] Output from seqret in fastaformat.
Message-ID: <934F95E71B6C9347A873C42AE3C196190B84C672@NZT0004E.dknz.nzcorp.net>


Hi.. 

I've godt dbxflat to index the swissprot database.. but I'd like to have the output 
formatted with the USA as the fasta ID. 

Current..:

seqret UNIPROT:Q12345
Reads and writes (returns) sequences
output sequence(s) [ies3_yeast.fasta]:

>IES3_YEAST Q12345 Ino eighty subunit 3.
MKFEDLLATNKQVQFAHAATQHYKSVKTPDFLEKDPHHKKFHNADGLNQQGSSTPSTATD
ANAASTASTHTNTTTFKRHIVAVDDISKMNYEMIKNSPGNVITNANQDEIDISTLKTRLY
KDNLYAMNDNFLQAVNDQIVTLNAAEQDQETEDPDLSDDEKIDILTKIQENLLEEYQKLS
QKERKWFILKELLLDANVELDLFSNRGRKASHPIAFGAVAIPTNVNANSLAFNRTKRRKI
NKNGLLENIL

.. but I'd like.. 

>UNIPROT:Q12345 Ino eighty subunit 3.
MKFEDLLATNKQVQFAHAATQHYKSVKTPDFLEKDPHHKKFHNADGLNQQGSSTPSTATD
ANAASTASTHTNTTTFKRHIVAVDDISKMNYEMIKNSPGNVITNANQDEIDISTLKTRLY
KDNLYAMNDNFLQAVNDQIVTLNAAEQDQETEDPDLSDDEKIDILTKIQENLLEEYQKLS
QKERKWFILKELLLDANVELDLFSNRGRKASHPIAFGAVAIPTNVNANSLAFNRTKRRKI
NKNGLLENIL

Is that possible? 


-- 
Jesper Krogh


From maoj at helix.nih.gov  Wed Dec  6 14:55:33 2006
From: maoj at helix.nih.gov (Jean Mao)
Date: Wed, 6 Dec 2006 09:55:33 -0500
Subject: [EMBOSS] Question regarding dbxflat entry number processed
Message-ID: <000a01c71946$94beb600$be4de780@CIT.NIH.GOV>

Hi, I am using dbxflat to index a database. I would like to find out how
many entries were processed. In the index file database.pxid, there is a
line :

Count      123456

which is very close to the number of entries in the database file but not
exact the same. Is there a way to find out? Thank you very much.

Jean Mao


From ajb at ebi.ac.uk  Wed Dec  6 15:23:44 2006
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Wed, 6 Dec 2006 15:23:44 -0000 (GMT)
Subject: [EMBOSS] Question regarding dbxflat entry number processed
In-Reply-To: <000a01c71946$94beb600$be4de780@CIT.NIH.GOV>
References: <000a01c71946$94beb600$be4de780@CIT.NIH.GOV>
Message-ID: <48753.81.98.244.247.1165418624.squirrel@webmail.ebi.ac.uk>

Hi Jean,

Usually you just need to add 1 to the 'Count' value as the counting
works from 0 to n-1 rather than 1 to n. However, if there are duplicate
keys in the database then that cannot be relied upon: the
Count is representative of the number of unique keys at the top level
of the tree and does not include any duplicates indexed in a subtree.

HTH

Alan


From JK at novozymes.com  Wed Dec  6 15:33:11 2006
From: JK at novozymes.com (JK (Jesper Agerbo Krogh))
Date: Wed, 6 Dec 2006 16:33:11 +0100
Subject: [EMBOSS] Output from seqret in fastaformat.
In-Reply-To: <4576C594.3080609@ebi.ac.uk>
Message-ID: <934F95E71B6C9347A873C42AE3C196191386B8DD@NZT0004E.dknz.nzcorp.net>

Hi,
> 
> Use -osdbname UNIPROT in the command line.

That sort of works... but that gives me the DATABASE:ID not the
DATABASE:AC in the fasta-header. 

Whats the actual difference between the id and the accessionnumbers?

-- 
Jesper Krogh


From pmr at ebi.ac.uk  Wed Dec  6 17:30:58 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Wed, 6 Dec 2006 17:30:58 -0000 (GMT)
Subject: [EMBOSS] Output from seqret in fastaformat.
In-Reply-To: <934F95E71B6C9347A873C42AE3C196191386B8DD@NZT0004E.dknz.nzcorp.net>
References: <4576C594.3080609@ebi.ac.uk>
	<934F95E71B6C9347A873C42AE3C196191386B8DD@NZT0004E.dknz.nzcorp.net>
Message-ID: <14912.193.173.109.1.1165426258.squirrel@webmail.ebi.ac.uk>

Hi Jesper,

>> Use -osdbname UNIPROT in the command line.
>
> That sort of works... but that gives me the DATABASE:ID not the
> DATABASE:AC in the fasta-header.

Yup, you need to redefine the ID as well with -sid Q12345

> Whats the actual difference between the id and the accessionnumbers?

The id is the identifier on the ID line of the entry.

The accession number is from the AC line - also a unique identifier but
completely unmemorable. Given the choice, we prefer the real ID. Entries
can also have more than one accession number (more common for EMBL entries
than for UniProt) where entries are merged or changed.

entret will show you the full entry so you can see where the identifiers
come from.

Hope that helps,

Peter


From mincloud at gmail.com  Thu Dec  7 20:36:03 2006
From: mincloud at gmail.com (yun zheng)
Date: Thu, 7 Dec 2006 14:36:03 -0600
Subject: [EMBOSS] how to find unique DNA sequences from a large database
Message-ID: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>

Hi,

Are there any tools for find unique sequences from a large database? Many
thanks.

I need to find unique DNA sequences from a large database. A short piece is
given as follows.

>001
aaaagttgtgtgtgtatgacaggtt
>013
aacctgtcatacacacacaactttt
>289
gttgtgtgtgtatgacaggtt
>375
tgtgtgtatgacaggttgat
>319
tcaacctgtcatacacaca
>177
cgcagtgtgtgtatgacagg
>271
gtcctacctgtcatacacac
>020
aagacataatgtgtgtatgacag

All these seem to be the same sequence, since BLASTN gives very small
e-values for their alignments.

BLASTN 2.2.8 [Jan-05-2004]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= 001
         (25 letters)

Database: drought-clustered.fa
           410 sequences; 8877 total letters

Searching.done

                                                                 Score    E
Sequences producing significant alignments:                      (bits)
Value

013                                                                    50
8e-11
001                                                                    50
8e-11
289                                                                    42
2e-08
375                                                                    34
5e-06
319                                                                    34
5e-06
177                                                                    32
2e-05
271                                                                    30
8e-05
020                                                                    28
3e-04

Best regards.

sincerely

Zheng, Yun

Department of Computer Science

Washington University in St Louis

Campus Box 1045

1 Brookings Drive, St Louis, MO 63130


From mthon at tamu.edu  Thu Dec  7 23:55:00 2006
From: mthon at tamu.edu (Michael Thon)
Date: Thu, 7 Dec 2006 17:55:00 -0600
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
Message-ID: <7F1F24A9-10FD-462D-BD63-349AD4538EB9@tamu.edu>

Hi Yun , you might try a clustering algorithm like blastclust (single  
linkage clustering) or mcl (a.k.a tribe-mcl) or one of the others  
that exist.  I can't think of any EMBOSS apps that would solve this  
problem, but maybe someone else has a better answer.
Mike


On Dec 7, 2006, at 2:36 PM, yun zheng wrote:

> Hi,
>
> Are there any tools for find unique sequences from a large  
> database? Many
> thanks.
>
> I need to find unique DNA sequences from a large database. A short  
> piece is
> given as follows.
>
>> 001
> aaaagttgtgtgtgtatgacaggtt
>> 013
> aacctgtcatacacacacaactttt
>> 289
> gttgtgtgtgtatgacaggtt
>> 375
> tgtgtgtatgacaggttgat
>> 319
> tcaacctgtcatacacaca
>> 177
> cgcagtgtgtgtatgacagg
>> 271
> gtcctacctgtcatacacac
>> 020
> aagacataatgtgtgtatgacag
>
> All these seem to be the same sequence, since BLASTN gives very small
> e-values for their alignments.
>
> BLASTN 2.2.8 [Jan-05-2004]
>
>
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.  
> Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> "Gapped BLAST and PSI-BLAST: a new generation of protein database  
> search
> programs",  Nucleic Acids Res. 25:3389-3402.
>
> Query= 001
>          (25 letters)
>
> Database: drought-clustered.fa
>            410 sequences; 8877 total letters
>
> Searching.done
>
>                                                                   
> Score    E
> Sequences producing significant alignments:                       
> (bits)
> Value
>
> 013                                                                    
>  50
> 8e-11
> 001                                                                    
>  50
> 8e-11
> 289                                                                    
>  42
> 2e-08
> 375                                                                    
>  34
> 5e-06
> 319                                                                    
>  34
> 5e-06
> 177                                                                    
>  32
> 2e-05
> 271                                                                    
>  30
> 8e-05
> 020                                                                    
>  28
> 3e-04
>
> Best regards.
>
> sincerely
>
> Zheng, Yun
>
> Department of Computer Science
>
> Washington University in St Louis
>
> Campus Box 1045
>
> 1 Brookings Drive, St Louis, MO 63130
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From ztu at msi.umn.edu  Fri Dec  8 01:30:38 2006
From: ztu at msi.umn.edu (Zheng Jin Tu)
Date: Thu, 7 Dec 2006 19:30:38 -0600 (CST)
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <7F1F24A9-10FD-462D-BD63-349AD4538EB9@tamu.edu>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
	<7F1F24A9-10FD-462D-BD63-349AD4538EB9@tamu.edu>
Message-ID: <Pine.LNX.4.61.0612071917590.21009@virga.msi.umn.edu>


Although these are not the good ways to do,
they are the workable solutions:

First, for each sequence in your database, make
a long string of sequence.  Then use a for loop
scan over your long sequence string with the
window size of your search sequence.  You do
all for each sequences in the database.  It
may take a few days if you need to scan big
databases such as human genome.

The other way is to elongate your short query
to 17 or 21 nt (not sure which is the shortest one
that blast works) long where blast can search.
That means, if you have 15 nt oligo, you can
creat four x four possible 17 nt sequences.
Such as:

   AAACCCGGGC CCTTTAAaa
   AAACCCGGGC CCTTTAAag
   AAACCCGGGC CCTTTAAac
   AAACCCGGGC CCTTTAAat
   AAACCCGGGC CCTTTAAga
   AAACCCGGGC CCTTTAAgg
   AAACCCGGGC CCTTTAAgc
   AAACCCGGGC CCTTTAAgt
   AAACCCGGGC CCTTTAAca
   AAACCCGGGC CCTTTAAct
   AAACCCGGGC CCTTTAAcg
   AAACCCGGGC CCTTTAAcc
   .....

Then you run blast and combine all results from
16 17-nt sequences as the hits for your 15 nt
query sequence.

Hope this useful.


Thanks,  TU

==================================

On Thu, 7 Dec 2006, Michael Thon wrote:

> Hi Yun , you might try a clustering algorithm like blastclust (single
> linkage clustering) or mcl (a.k.a tribe-mcl) or one of the others
> that exist.  I can't think of any EMBOSS apps that would solve this
> problem, but maybe someone else has a better answer.
> Mike
>
>
> On Dec 7, 2006, at 2:36 PM, yun zheng wrote:
>
>> Hi,
>>
>> Are there any tools for find unique sequences from a large
>> database? Many
>> thanks.
>>
>> I need to find unique DNA sequences from a large database. A short
>> piece is
>> given as follows.
>>
>>> 001
>> aaaagttgtgtgtgtatgacaggtt
>>> 013
>> aacctgtcatacacacacaactttt
>>> 289
>> gttgtgtgtgtatgacaggtt
>>> 375
>> tgtgtgtatgacaggttgat
>>> 319
>> tcaacctgtcatacacaca
>>> 177
>> cgcagtgtgtgtatgacagg
>>> 271
>> gtcctacctgtcatacacac
>>> 020
>> aagacataatgtgtgtatgacag
>>
>> All these seem to be the same sequence, since BLASTN gives very small
>> e-values for their alignments.
>>
>> BLASTN 2.2.8 [Jan-05-2004]
>>
>>
>> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
>> Schaffer,
>> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
>> "Gapped BLAST and PSI-BLAST: a new generation of protein database
>> search
>> programs",  Nucleic Acids Res. 25:3389-3402.
>>
>> Query= 001
>>          (25 letters)
>>
>> Database: drought-clustered.fa
>>            410 sequences; 8877 total letters
>>
>> Searching.done
>>
>>
>> Score    E
>> Sequences producing significant alignments:
>> (bits)
>> Value
>>
>> 013
>>  50
>> 8e-11
>> 001
>>  50
>> 8e-11
>> 289
>>  42
>> 2e-08
>> 375
>>  34
>> 5e-06
>> 319
>>  34
>> 5e-06
>> 177
>>  32
>> 2e-05
>> 271
>>  30
>> 8e-05
>> 020
>>  28
>> 3e-04
>>
>> Best regards.
>>
>> sincerely
>>
>> Zheng, Yun
>>
>> Department of Computer Science
>>
>> Washington University in St Louis
>>
>> Campus Box 1045
>>
>> 1 Brookings Drive, St Louis, MO 63130
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From pmr at ebi.ac.uk  Fri Dec  8 08:37:32 2006
From: pmr at ebi.ac.uk (pmr at ebi.ac.uk)
Date: Fri, 8 Dec 2006 08:37:32 -0000 (GMT)
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
Message-ID: <1636.217.44.134.240.1165567052.squirrel@webmail.ebi.ac.uk>

Dear Yun Zheng,

> Are there any tools for find unique sequences from a large database? Many
> thanks.
>
> I need to find unique DNA sequences from a large database. A short piece
> is
> given as follows.
>

> All these seem to be the same sequence, since BLASTN gives very small
> e-values for their alignments.

Remember than BLASTN is a local alignment tool. The small e-values
indicate that some part of your 001 query sequence is similar to some part
of a sequence in the database.

You need to check what is matching in the alignments reported by BLASTN.
One useful test is whether the whole length of your query is matching to
any of the sequences in the database, also for DNA whether it is matching
in one or both directions (as sequences can have biologically significant
inverted repeats).

There are tools (not in EMBOSS) available for building non-redundant
databases - excluding sequences which are subsequences of others in the
database, or selecting one of a set of sequences that match closely over
their whole length. But you do have to decide what you mean by redundancy
and make sure that the methods you apply are appropriate.

Hope that helps,

Peter Rice


From mincloud at gmail.com  Fri Dec  8 18:50:40 2006
From: mincloud at gmail.com (yun zheng)
Date: Fri, 8 Dec 2006 12:50:40 -0600
Subject: [EMBOSS] how to find unique DNA sequences from a large database
In-Reply-To: <1636.217.44.134.240.1165567052.squirrel@webmail.ebi.ac.uk>
References: <8f6eb9540612071236i27bf5d28k1e921d220ea0d9b5@mail.gmail.com>
	<1636.217.44.134.240.1165567052.squirrel@webmail.ebi.ac.uk>
Message-ID: <8f6eb9540612081050n2e9b745lb28b79eb9dffb82f@mail.gmail.com>

Dear All,

Many thanks for your reply.

Best regards.

sincerely
zheng, yun


On 12/8/06, pmr at ebi.ac.uk <pmr at ebi.ac.uk> wrote:
>
> Dear Yun Zheng,
>
> > Are there any tools for find unique sequences from a large database?
> Many
> > thanks.
> >
> > I need to find unique DNA sequences from a large database. A short piece
> > is
> > given as follows.
> >
>
> > All these seem to be the same sequence, since BLASTN gives very small
> > e-values for their alignments.
>
> Remember than BLASTN is a local alignment tool. The small e-values
> indicate that some part of your 001 query sequence is similar to some part
> of a sequence in the database.
>
> You need to check what is matching in the alignments reported by BLASTN.
> One useful test is whether the whole length of your query is matching to
> any of the sequences in the database, also for DNA whether it is matching
> in one or both directions (as sequences can have biologically significant
> inverted repeats).
>
> There are tools (not in EMBOSS) available for building non-redundant
> databases - excluding sequences which are subsequences of others in the
> database, or selecting one of a set of sequences that match closely over
> their whole length. But you do have to decide what you mean by redundancy
> and make sure that the methods you apply are appropriate.
>
> Hope that helps,
>
> Peter Rice
>
>


From bobwohlhueter at earthlink.net  Tue Dec 12 22:32:13 2006
From: bobwohlhueter at earthlink.net (Robert Wohlhueter)
Date: Tue, 12 Dec 2006 17:32:13 -0500
Subject: [EMBOSS] jemboss standalone installation problems on MacBook/Intel
Message-ID: <457F2DED.7040903@earthlink.net>

Trying to run emboss/jemboss as a standalone on MacBookPro/intel under
OS 10.4.  Downloaded and installed from fink.sourceforge.net (in fink's
preferred /sw/share/EMBOSS tree).  [In trying to circumvent the proglems
described below, I also built, in the /usr/local tree, from source code
at emboss.sourceforge.net, an entirely separate installation.  I get
exactly the same set of errors with that installation.]

Facts as best I comprehend them:
1) the envvars, including CLASSPATH, JEMBOSS_HOME, EMBSS_INSTALL, etc.,
are set as specified in $EMBOSS_INSTALL/jemboss/runJemboss.sh, and seem
correct to me.  jars, *.class, *.java, and the several executables that
comprise the emboss suite of applications, as near as I can tell are all
present.

2) java runtime is that bundled by Apple with OS 10.4, namely
   {summer:~}66 bobw$ java -version
   java version "1.5.0_06"
   Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-112)
   Java HotSpot(TM) Client VM (build 1.5.0_06-64, mixed mode, sharing)

3) As suggested in Emboss Administrator's Guide, testing the
installation with
   $ wossname -auto
works as expected, listing countless applications in various categories.

4) When I try to run `java $JEMBOSS_HOME/org/emboss/jemboss/Jemboss local &`
   {summer:~}68 bobw$ jemboss
   [1] 736
   {summer:~}69 bobw$ Exception in thread "main"
java.lang.NoClassDefFoundError:
       /sw/share/EMBOSS/jemboss/org/emboss/jemboss/Jemboss
I'll worry about this later.

5) The nub of the problem is when I run `java org.emboss.jemboss.Jemboss
local &` from JEMBOSS_HOME, I get the following messages:

{summer:/sw/share/EMBOSS/jemboss}60 bobw$ Exception in thread "Thread-2"
java.lang.NullPointerException
       at
org.emboss.jemboss.gui.BuildProgramMenu$1.construct(BuildProgramMenu.java:278)
       at org.emboss.jemboss.gui.SwingWorker$2.run(SwingWorker.java:127)
       at java.lang.Thread.run(Thread.java:613)

I'm not a java programmer, but when I look at source code in
BuildProgramMenu.java, it looks like the error arises in a routine which
is trying to construct a "dataFile" file specification from data in a
object called "mysettings".  I don't see how/where "mysettings" is
defined, but I'm suspicious that it is intended to read data from my
local settings (envvars, emboss.defaults ??), is not able to, and thus
passes null information to the new datafile specification.

Can anybody elucidate the source of data in "mysettings" and give me a
hint what I need to do to supply it?

Thanks for any and all pointers,

Bob Wohlhueter


From tjc at sanger.ac.uk  Wed Dec 13 07:11:18 2006
From: tjc at sanger.ac.uk (Tim Carver)
Date: Wed, 13 Dec 2006 07:11:18 +0000
Subject: [EMBOSS] jemboss standalone installation problems on
 MacBook/Intel
In-Reply-To: <457F2DED.7040903@earthlink.net>
Message-ID: <C1A55816.2C2D1%tjc@sanger.ac.uk>

Hi Robert

Jemboss has not been set up to work with fink. You do need to use the EMBOSS
download (including any patches) and install using the script as described
at:

http://emboss.sourceforge.net/Jemboss/install/standalone.html

Also make sure you have the latest java from:

http://www.apple.com/downloads/macosx/apple/j2se50release4intel.html

Regards
Tim Carver


On 12/12/06 22:32, "Robert Wohlhueter" <bobwohlhueter at earthlink.net> wrote:

> Trying to run emboss/jemboss as a standalone on MacBookPro/intel under
> OS 10.4.  Downloaded and installed from fink.sourceforge.net (in fink's
> preferred /sw/share/EMBOSS tree).  [In trying to circumvent the proglems
> described below, I also built, in the /usr/local tree, from source code
> at emboss.sourceforge.net, an entirely separate installation.  I get
> exactly the same set of errors with that installation.]
> 
> Facts as best I comprehend them:
> 1) the envvars, including CLASSPATH, JEMBOSS_HOME, EMBSS_INSTALL, etc.,
> are set as specified in $EMBOSS_INSTALL/jemboss/runJemboss.sh, and seem
> correct to me.  jars, *.class, *.java, and the several executables that
> comprise the emboss suite of applications, as near as I can tell are all
> present.
> 
> 2) java runtime is that bundled by Apple with OS 10.4, namely
>    {summer:~}66 bobw$ java -version
>    java version "1.5.0_06"
>    Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-112)
>    Java HotSpot(TM) Client VM (build 1.5.0_06-64, mixed mode, sharing)
> 
> 3) As suggested in Emboss Administrator's Guide, testing the
> installation with
>    $ wossname -auto
> works as expected, listing countless applications in various categories.
> 
> 4) When I try to run `java $JEMBOSS_HOME/org/emboss/jemboss/Jemboss local &`
>    {summer:~}68 bobw$ jemboss
>    [1] 736
>    {summer:~}69 bobw$ Exception in thread "main"
> java.lang.NoClassDefFoundError:
>        /sw/share/EMBOSS/jemboss/org/emboss/jemboss/Jemboss
> I'll worry about this later.
> 
> 5) The nub of the problem is when I run `java org.emboss.jemboss.Jemboss
> local &` from JEMBOSS_HOME, I get the following messages:
> 
> {summer:/sw/share/EMBOSS/jemboss}60 bobw$ Exception in thread "Thread-2"
> java.lang.NullPointerException
>        at
> org.emboss.jemboss.gui.BuildProgramMenu$1.construct(BuildProgramMenu.java:278)
>        at org.emboss.jemboss.gui.SwingWorker$2.run(SwingWorker.java:127)
>        at java.lang.Thread.run(Thread.java:613)
> 
> I'm not a java programmer, but when I look at source code in
> BuildProgramMenu.java, it looks like the error arises in a routine which
> is trying to construct a "dataFile" file specification from data in a
> object called "mysettings".  I don't see how/where "mysettings" is
> defined, but I'm suspicious that it is intended to read data from my
> local settings (envvars, emboss.defaults ??), is not able to, and thus
> passes null information to the new datafile specification.
> 
> Can anybody elucidate the source of data in "mysettings" and give me a
> hint what I need to do to supply it?
> 
> Thanks for any and all pointers,
> 
> Bob Wohlhueter
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From pandey.gaurav at gmail.com  Thu Dec 14 04:43:06 2006
From: pandey.gaurav at gmail.com (Gaurav Pandey)
Date: Wed, 13 Dec 2006 22:43:06 -0600
Subject: [EMBOSS] New Review: Computational Approaches for Protein Function
	Prediction
Message-ID: <627ca1900612132043l48a54760ibda9de67fe35a8f5@mail.gmail.com>

 [Apologies if you receive this more than once]

Dear Colleague,

We are pleased to share with you a recent review of several hundred papers
in the field of computational protein function prediction:

Title: Computational Approaches for Protein Function Prediction: A Survey

Authors: Gaurav Pandey <http://www.cs.umn.edu/%7Egaurav>, Vipin
Kumar<http://www.cs.umn.edu/%7Ekumar>and Michael
Steinbach <http://www.cs.umn.edu/%7Esteinbac>

Available at: http://www.cs.umn.edu/~kumar/papers/survey.php<http://www.cs.umn.edu/%7Ekumar/papers/survey.php>

Abstract

Proteins are the most essential and versatile macromolecules of life, and
the knowledge of their functions is a crucial link in the development of new
drugs, better crops, and even the development of synthetic biochemicals such
as biofuels. Experimental procedures for protein function prediction are
inherently low throughput and are thus unable to annotate a non-trivial
fraction of proteins that are becoming available due to rapid advances in
genome sequencing technology. This has motivated the development of
computational techniques that utilize a variety of high-throughput
experimental data for protein function prediction, such as protein and
genome sequences, gene expression data, protein interaction networks and
phylogenetic profiles. Indeed, in a short period of a decade, several
hundred articles have been published on this topic. This review aims to
discuss this wide spectrum of approaches
by categorizing them in terms of the data type they use for predicting
function, and thus identify the trends and needs of this very important
field. The survey is expected to be useful for computational biologists and
bioinformaticians aiming to get an overview of the field of computational
function prediction, and identify areas that can benefit from further
research.


Your comments on the article, or any part thereof, are welcome.

Thanks and best regards

Gaurav Pandey (gaurav at cs.umn.edu)
Vipin Kumar (kumar at cs.umn.edu)
Michael Steinbach (steinbac at cs.umn.edu)


From msarachu at biol.unlp.edu.ar  Wed Dec 27 14:01:37 2006
From: msarachu at biol.unlp.edu.ar (=?iso-8859-1?b?TWFydO1u?= Sarachu)
Date: Wed, 27 Dec 2006 11:01:37 -0300
Subject: [EMBOSS] wEMBOSS-1.7.1 released
Message-ID: <1167228097.45927cc185874@webmail.biol.unlp.edu.ar>

This release is mainly to fix a problem with editing nucList and protList and
also includes some minor changes.
wrappers4EMBOSS-1.5.1 is included in this wEMBOSS release.
wEMBOSS can be downloaded from http://www.wemboss.org

Best regards,

The wEMBOSS team

-- 
Mart?n Sarachu
msarachu at biol.unlp.edu.ar
EMBnet Argentina
http://www.ar.embnet.org