From stephen.taylor at imm.ox.ac.uk  Thu Sep 10 09:37:18 2009
From: stephen.taylor at imm.ox.ac.uk (Stephen Taylor)
Date: Thu, 10 Sep 2009 14:37:18 +0100
Subject: [EMBOSS] PWMs in EMBOSS
Message-ID: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk>

Hi,

Is it possible to search PWMs in the following format in EMBOSS?

Thanks,

Steve


Name

A:	0.157272762655186	0.101522869320827	0.193269908676739	 
0.400388333956085	0.0242258313143186	0.00799984439409978	 
0.985140882956166	0.982735821742488	0.00822519796024231	 
0.00216973239819103	0.915564050240553	0.620804636015513	 
0.170770180463701	0.288450091623397	0.357825784599729	 
0.234712937095288	0.12172997792596
C:	0.585688175787352	0.237479093836628	0.435987353829905	 
0.147300012153726	0.342087245546403	0.0758172877627498	 
0.00580600999936562	0.00109267794415073	0.00794630235311888	 
0.00688337464627776	0.000618817602597609	0.0128822871237646	 
0.230733571623305	0.276650970023954	0.233013950290859	 
0.285797488742033	0.426544489294796
G:	0.146549771679813	0.564379933620146	0.162348523327608	 
0.288303072582316	0.0128822871237646	0.000618817602597609	 
0.00688337464627776	0.00794630235311888	0.00109267794415073	 
0.00580600999936562	0.0758172877627498	0.342087245546403	 
0.181573332311972	0.131046633924716	0.187725054521253	 
0.247196691859296	0.213778652608617
T:	0.110489289877649	0.0966181032223992	0.208394214165748	 
0.164008581307874	0.620804636015513	0.915564050240553	 
0.00216973239819103	0.00822519796024231	0.982735821742488	 
0.985140882956166	0.00799984439409978	0.0242258313143186	 
0.416922915601023	0.303852304427932	0.221435210588158	 
0.232292882303383	0.237946880170627


From pmr at ebi.ac.uk  Thu Sep 10 10:54:25 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 10 Sep 2009 15:54:25 +0100
Subject: [EMBOSS] PWMs in EMBOSS
In-Reply-To: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk>
References: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk>
Message-ID: <4AA91321.1020002@ebi.ac.uk>

Stephen Taylor wrote:
> Hi,
> 
> Is it possible to search PWMs in the following format in EMBOSS?

Not yet ... but we can extend the formats for PWMs. We only support the
formats that we write.

Votes please on position weight matrix formats EMBOSS should be able to
read ...

regards,

Peter

From charles-listes-emboss at plessy.org  Wed Sep 16 02:57:55 2009
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Wed, 16 Sep 2009 15:57:55 +0900
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
Message-ID: <20090916065755.GA15425@kunpuu.plessy.org>

Dear EMBOSS developers,

I have multi-sequence file in FASTQ format that contains sequencing reads, and
would like to retreive them the with seqret. But as you see in the following
example, quality scores are not preserved:

$ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
Reads and writes (returns) sequences
@F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68
AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
+
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

The purpose was to use seqret as a workaround for the fact that vectorstrip
does not keep the quality either.

Do you think that it would be possible to get this functionality as a patch in
the future, or is it big work that needs to wait for the next release?

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan

From uludag at ebi.ac.uk  Wed Sep 16 05:12:16 2009
From: uludag at ebi.ac.uk (Mahmut Uludag)
Date: Wed, 16 Sep 2009 10:12:16 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org>
References: <20090916065755.GA15425@kunpuu.plessy.org>
Message-ID: <1253092337.32439.31.camel@emboss2.ebi.ac.uk>

Hi Charles,

seqret returns quality scores if the input sequence format is explicitly
defined on the command line, such as -sformat=fastq-sanger.

The following patch looks like fixes the vectorstrip problem.


*** ajseq.c.org	2009-09-16 10:08:17.000000000 +0100
--- ajseq.c	2009-09-16 09:52:56.000000000 +0100
***************
*** 781,786 ****
--- 781,792 ----
  
      if (seq->Fttable)
  	pthis->Fttable = ajFeattableCopy(seq->Fttable);
+     
+     if (seq->Accuracy)
+     {
+     	AJCNEW0(pthis->Accuracy,seq->Seq->Len);
+
memmove(pthis->Accuracy,seq->Accuracy,seq->Seq->Len*sizeof(float));
+     }
  
      return pthis;
  }


Regards,
Mahmut


> I have multi-sequence file in FASTQ format that contains sequencing reads, and
> would like to retreive them the with seqret. But as you see in the following
> example, quality scores are not preserved:
> 
> $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
> Reads and writes (returns) sequences
> @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68
> AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
> +
> """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> 
> The purpose was to use seqret as a workaround for the fact that vectorstrip
> does not keep the quality either.
> 
> Do you think that it would be possible to get this functionality as a patch in
> the future, or is it big work that needs to wait for the next release?


From biopython at maubp.freeserve.co.uk  Wed Sep 16 05:31:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Sep 2009 10:31:22 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org>
References: <20090916065755.GA15425@kunpuu.plessy.org>
Message-ID: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>

On Wed, Sep 16, 2009 at 7:57 AM, Charles Plessy
<charles-listes-emboss at plessy.org> wrote:
>
> Dear EMBOSS developers,
>
> I have multi-sequence file in FASTQ format that contains sequencing reads, and
> would like to retreive them the with seqret. But as you see in the following
> example, quality scores are not preserved:
>
> $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
> Reads and writes (returns) sequences
> @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68
> AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
> +
> """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

You need to use "fastq-sanger" (or the other variants), since in
EMBOSS, "fastq" currently means FASTQ ignoring the qualities.
This is documented:

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

As an EMBOSS user, I think the current situation is confusing, and
it would make much more sense to have "fastq" just an alias for
"fastq-sanger" (which would be consistent with Biopython and BioPerl).

http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html

And also this email - especially the last example:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html

> The purpose was to use seqret as a workaround for the fact that
> vectorstrip does not keep the quality either.

That's also been suggested, and is likely to be supported in future.
http://lists.open-bio.org/pipermail/emboss/2009-August/003722.html

Peter

From charles-listes-emboss at plessy.org  Thu Sep 17 02:36:59 2009
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Thu, 17 Sep 2009 15:36:59 +0900
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
References: <20090916065755.GA15425@kunpuu.plessy.org>
	<320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
Message-ID: <20090917063659.GA27021@kunpuu.plessy.org>

Le Wed, Sep 16, 2009 at 10:12:16AM +0100, Mahmut Uludag a ?crit :
> 
> seqret returns quality scores if the input sequence format is explicitly
> defined on the command line, such as -sformat=fastq-sanger.
> 
> The following patch looks like fixes the vectorstrip problem.


Le Wed, Sep 16, 2009 at 10:31:22AM +0100, Peter a ?crit :
> 
> You need to use "fastq-sanger" (or the other variants), since in
> EMBOSS, "fastq" currently means FASTQ ignoring the qualities.


Hi Mahmut and Peter, and thank you very much for your answers!

I would also like if the qualities were kept by default. I actually had tried
to force the fastq-sanger format before, but by adding its name to the USAs,
like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did
not work; I do not know if it is by design or because of the dash in the format
name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very
well after I applied Mahmut's patch.

I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too
prematurate. In particular, I have the following warning each time the quality
is encoded by an equal sign:

  Warning: Illegal character '='
  Warning: Illegal pattern: =

By the way, I think I found a bug in revseq: it seems that it does not reverse
the qualities:

  $ echo -e "@toto\nACTG\n+toto\n12/3" | seqret -filter -sformat=fastq-sanger -osformat=fastq-sanger
  @toto
  ACTG
  +
  12/3
  
  $ echo -e "@toto\nACTG\n+toto\n12/3" | revseq -filter -sformat=fastq-sanger -osformat=fastq-sanger
  @toto Reversed:
  CAGT
  +
  12/3

Also, in contrary to what the documentation predicts, using the fastq format
for the output does not ignore the quality scores. (Not that would be
particularly useful, but?)

  $ echo -e "@toto\nACTG\n+toto\nACTG" | revseq -filter -sformat=fastq-sanger -osformat=fastq
  @toto Reversed:
  CAGT
  +
  ACTG


Have a nice day,

-- 
Charles Plessy
http://charles.plessy.org
Tsurumi, Kanagawa, Japan

From pmr at ebi.ac.uk  Thu Sep 17 03:24:07 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 17 Sep 2009 08:24:07 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <20090917063659.GA27021@kunpuu.plessy.org>
References: <20090916065755.GA15425@kunpuu.plessy.org>	<320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>	<20090916065755.GA15425@kunpuu.plessy.org>	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
Message-ID: <4AB1E417.3010405@ebi.ac.uk>

Charles Plessy wrote:
  > I would also like if the qualities were kept by default. I actually 
had tried
> to force the fastq-sanger format before, but by adding its name to the USAs,
> like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did
> not work; I do not know if it is by design or because of the dash in the format
> name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very
> well after I applied Mahmut's patch.

Yes, the dash in the format name is causing problems. It should be 
allowed where there is a '::' in the USA (it is not allowed in database 
queries because of the dbname-field:value query syntax).

I will make a patch for this.

> I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too
> prematurate. In particular, I have the following warning each time the quality
> is encoded by an equal sign:
> 
>   Warning: Illegal character '='
>   Warning: Illegal pattern: =

This is surprising. Is your EMBOSS version the original distribution or 
have you applied the current patches.

If it fails with the patched version, could you send me an input file 
that causes this error.

> By the way, I think I found a bug in revseq: it seems that it does not reverse
> the qualities:

True ... this I will also patch. We have used quaslities for some years 
(in Staden experiment format) but it appears nobody has reversed 
sequences and kept the qualities. Life is changing with FASTQ data!

> Also, in contrary to what the documentation predicts, using the fastq format
> for the output does not ignore the quality scores. (Not that would be
> particularly useful, but?)

This is deliberate. We have to write somethign in FASTQ format and we 
default to the fastq-sanger format. On input, fastq-sanger ignores 
qualities because there is no safe way to decide which format is correct.

regards,

Peter

From biopython at maubp.freeserve.co.uk  Thu Sep 17 05:11:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 10:11:15 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB1E417.3010405@ebi.ac.uk>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
Message-ID: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>

On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>> Also, in contrary to what the documentation predicts, using the fastq
>> format for the output does not ignore the quality scores. (Not that
>> would be particularly useful, but?)
>
> This is deliberate. We have to write somethign in FASTQ format and we
> default to the fastq-sanger format. On input, fastq-sanger ignores qualities
> because there is no safe way to decide which format is correct.

So again, could you reconsider making "fastq" act like "fastq-sanger"?
The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
know which kind of FASTQ file you have, and you don't care about the
qualities, parsing it as a Sanger FASTQ file will work.

Peter C.


From pmr at ebi.ac.uk  Thu Sep 17 05:18:59 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 17 Sep 2009 10:18:59 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>	
	<20090916065755.GA15425@kunpuu.plessy.org>	
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>	
	<20090917063659.GA27021@kunpuu.plessy.org>	
	<4AB1E417.3010405@ebi.ac.uk>
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
Message-ID: <4AB1FF03.80705@ebi.ac.uk>

Peter C. wrote:
> On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> Also, in contrary to what the documentation predicts, using the fastq
>>> format for the output does not ignore the quality scores. (Not that
>>> would be particularly useful, but?)
>> This is deliberate. We have to write somethign in FASTQ format and we
>> default to the fastq-sanger format. On input, fastq-sanger ignores qualities
>> because there is no safe way to decide which format is correct.
> 
> So again, could you reconsider making "fastq" act like "fastq-sanger"?
> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
> know which kind of FASTQ file you have, and you don't care about the
> qualities, parsing it as a Sanger FASTQ file will work.

Yes, but it is dangerous if they could really be Solexa qualities.

What we could do is provide a utility that reads in fastq-sanger format 
and checks whether the quality scores make most sense as Sanger, SOlexa 
or Ilumina.

I consider reading as fastq-sanger by default to be rather dangerous.

Peter

From biopython at maubp.freeserve.co.uk  Thu Sep 17 05:32:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 10:32:21 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB1FF03.80705@ebi.ac.uk>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
	<4AB1FF03.80705@ebi.ac.uk>
Message-ID: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>

On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>> So again, could you reconsider making "fastq" act like "fastq-sanger"?
>> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
>> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
>> know which kind of FASTQ file you have, and you don't care about the
>> qualities, parsing it as a Sanger FASTQ file will work.
>
> Yes, but it is dangerous if they could really be Solexa qualities.

Indeed, or an Illumina 1.3+ encoded FASTQ file.

So if the EMBOSS tools are used to read a FASTQ file without specifying
the FASTQ variant, do the currently detect it is FASTQ and default to the
"fastq" setting and ignore the quality information?

> What we could do is provide a utility that reads in fastq-sanger format and
> checks whether the quality scores make most sense as Sanger, SOlexa or
> Ilumina.

That could be useful - I guess you could scan all the reads building up
a histogram of the ASCII characters used. This could immediately
rule out some of the options, and then based on the distribution (if
you assume they are raw reads) you could make a good guess.

> I consider reading as fastq-sanger by default to be rather dangerous.

That is understandable. How about removing the current "fastq" output
then? That might prevent some of the confusion at the moment. I'm
struggling to see any purpose for the current "fastq" output - can you
give me any example use case? Right now it has to pick an arbitrary
quality symbol, and uses ASCI 34 (double quote) which means PHRED
1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
Illumina 1.3+ FASTQ file.

Regards,

Peter

From pmr at ebi.ac.uk  Thu Sep 17 05:52:48 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 17 Sep 2009 10:52:48 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>	
	<20090916065755.GA15425@kunpuu.plessy.org>	
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>	
	<20090917063659.GA27021@kunpuu.plessy.org>	
	<4AB1E417.3010405@ebi.ac.uk>	
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>	
	<4AB1FF03.80705@ebi.ac.uk>
	<320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>
Message-ID: <4AB206F0.1040205@ebi.ac.uk>

Peter C. wrote:
> So if the EMBOSS tools are used to read a FASTQ file without specifying
> the FASTQ variant, do the currently detect it is FASTQ and default to the
> "fastq" setting and ignore the quality information?

Yes, exactly so.

Reading the sequence data is safe, and may be all the user wanted to do.

>> What we could do is provide a utility that reads in fastq-sanger format and
>> checks whether the quality scores make most sense as Sanger, SOlexa or
>> Ilumina.
> 
> That could be useful - I guess you could scan all the reads building up
> a histogram of the ASCII characters used. This could immediately
> rule out some of the options, and then based on the distribution (if
> you assume they are raw reads) you could make a good guess.

The ACD file would be 'interesting' We could set the default format to 
be "fastq-sanger" and issue some warning if we find the user had tried 
to change it. That way the application would run with a filename as the 
input, though it will appear to interfaces to be able to read any 
sequence input.

Are there rules we can use to decide on improbably qualities? Values 
below the Illumina and Solexa minima would seem a good guide, and 
perhaps values above the likely short read maximum score.

Maybe some existing pipelines have solme cutoff values we could adopt?

>> I consider reading as fastq-sanger by default to be rather dangerous.
> 
> That is understandable. How about removing the current "fastq" output
> then? That might prevent some of the confusion at the moment. I'm
> struggling to see any purpose for the current "fastq" output - can you
> give me any example use case? Right now it has to pick an arbitrary
> quality symbol, and uses ASCI 34 (double quote) which means PHRED
> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
> Illumina 1.3+ FASTQ file.

It is an alias for fastq-sanger which should be OK. I prefer to have an 
output format name for each input format name where it looks sensible, 
so if we read "fastq" as an input format it should do something on 
output. Unfortunately that means it has to write quality scores somehow.

regards,

Peter

From biopython at maubp.freeserve.co.uk  Thu Sep 17 06:06:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 11:06:16 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB206F0.1040205@ebi.ac.uk>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
	<4AB1FF03.80705@ebi.ac.uk>
	<320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>
	<4AB206F0.1040205@ebi.ac.uk>
Message-ID: <320fb6e00909170306s328294e9h72717c329baba057@mail.gmail.com>

On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>>> What we could do is provide a utility that reads in fastq-sanger format
>>> and checks whether the quality scores make most sense as Sanger,
>>> SOlexa or Ilumina.
>>
>> That could be useful - I guess you could scan all the reads building up
>> a histogram of the ASCII characters used. This could immediately
>> rule out some of the options, and then based on the distribution (if
>> you assume they are raw reads) you could make a good guess.
>
> The ACD file would be 'interesting' We could set the default format to be
> "fastq-sanger" and issue some warning if we find the user had tried to
> change it. That way the application would run with a filename as the input,
> though it will appear to interfaces to be able to read any sequence input.
>
> Are there rules we can use to decide on improbably qualities? Values below
> the Illumina and Solexa minima would seem a good guide, and perhaps
> values above the likely short read maximum score.
>
> Maybe some existing pipelines have solme cutoff values we could adopt?

Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina
reads should be easy. However, unless there are some ASCII characters
in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way
to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good
reads above Solexa/PHRED 10 (which would be ASCII 74), either way
it isn't going to make much difference. In any case, it will be heuristic,
and sometimes it will get it wrong (e.g. post processed Sanger FASTQ
files with high scores might look like raw reads in Solexa/Illumina
FASTQ).

>>> I consider reading as fastq-sanger by default to be rather dangerous.
>>
>> That is understandable. How about removing the current "fastq" output
>> then? That might prevent some of the confusion at the moment. I'm
>> struggling to see any purpose for the current "fastq" output - can you
>> give me any example use case? Right now it has to pick an arbitrary
>> quality symbol, and uses ASCI 34 (double quote) which means PHRED
>> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
>> Illumina 1.3+ FASTQ file.
>
> It is an alias for fastq-sanger which should be OK. I prefer to have an
> output format name for each input format name where it looks sensible,
> so if we read "fastq" as an input format it should do something on
> output. Unfortunately that means it has to write quality scores somehow.

I'm not convinced that the current "fastq" output (with the double quote
quality string) is entirely "sensible". But I'll drop this now - I've argued my
case, and will leave it at that. As long as the current behaviour is clear
in the documentation, it should be OK.

Regards,

Peter

From charles-listes-emboss at plessy.org  Fri Sep 18 09:09:46 2009
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Fri, 18 Sep 2009 22:09:46 +0900
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB1E417.3010405@ebi.ac.uk>
References: <20090916065755.GA15425@kunpuu.plessy.org>
	<320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
Message-ID: <20090918130946.GC16344@kunpuu.plessy.org>

Le Thu, Sep 17, 2009 at 08:24:07AM +0100, Peter Rice a ?crit :
> Charles Plessy wrote:
>
>> I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too
>> prematurate. In particular, I have the following warning each time the quality
>> is encoded by an equal sign:
>>
>>   Warning: Illegal character '='
>>   Warning: Illegal pattern: =
>
> This is surprising. Is your EMBOSS version the original distribution or  
> have you applied the current patches.

Actually, I worked on other files today and could not reproduce the result. I
will find a proper sourceforge bug if it ever comes back?

Have a nice week-end,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japon

From belegdol at gmail.com  Tue Sep 29 15:34:56 2009
From: belegdol at gmail.com (Julian Sikorski)
Date: Tue, 29 Sep 2009 21:34:56 +0200
Subject: [EMBOSS] Packaging EMBOSS for Fedora
In-Reply-To: <h4pdaa$ad5$1@ger.gmane.org>
References: <h0uef8$bdu$1@ger.gmane.org>	<1244850556.8999.7.camel@login-svr1.ebi.ac.uk>	<h3v43r$tc5$1@ger.gmane.org>	<49993.78.105.201.225.1248020309.squirrel@webmail.ebi.ac.uk>
	<h4pdaa$ad5$1@ger.gmane.org>
Message-ID: <h9tnh1$jfv$1@ger.gmane.org>

W dniu 29.07.2009 13:54, Julian Sikorski pisze:
> W dniu 19.07.2009 18:18, uludag at ebi.ac.uk pisze:
>>
>>> there seem to be some problems with make install:
>>>
>>> /usr/bin/make  install-exec-hook
>>> make[7]: Entering directory
>>> `/builddir/build/BUILD/EMBOSS-6.1.0/jemboss/org/emboss/jemboss/editor'
>>> mkdir -p --
>>> /builddir/build/BUILDROOT/EMBOSS-6.1.0-1.fc11.x86_64/usr/share/EMBOSS/jemboss/org/emboss/jemboss/editor
>>> /usr/bin/install: cannot stat `*.class': No such file or directory
>>
>> Looks like we didn't test the --with-java and --with-javaos configure
>> options well, before this release. However, most users will not need these
>> two options any more as EMBOSS-6.1.0 includes precompiled jemboss class
>> files collected in a java archive file. You should hopefully not get the
>> above error if you omit these two options when you configure your emboss
>> installation.
>>
>> Regards,
>> Mahmut
> Thank you, removing these two seems to have done the trick!
> 
> Julian
Another problem arose. I was pointed out that another problem pointed
out by rpmlint should be fixed:

EMBOSS-libs.x86_64: W: shared-lib-calls-exit
/usr/lib64/libeplplot.so.3.2.7 exit@

According to the tool, it means the following:

This library package calls exit() or _exit(), probably in a non-fork()
context. Doing so from a library is strongly discouraged - when a
library function calls exit(), it prevents the calling program from
handling the error, reporting it to the user, closing files properly,
and cleaning up any state that the program has. It is preferred for the
library to return an actual error code and let the calling program
decide how to handle the situation.

Also, the guidelines [1] say that the .jar files should be compiled from
source, so fixing the problem with --with-java would definitely help here.

Julian

[1]
https://fedoraproject.org/wiki/Packaging:Java#Pre-built_JAR_files_.2F_Other_bundled_software


From stephen.taylor at imm.ox.ac.uk  Thu Sep 10 13:37:18 2009
From: stephen.taylor at imm.ox.ac.uk (Stephen Taylor)
Date: Thu, 10 Sep 2009 14:37:18 +0100
Subject: [EMBOSS] PWMs in EMBOSS
Message-ID: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk>

Hi,

Is it possible to search PWMs in the following format in EMBOSS?

Thanks,

Steve


Name

A:	0.157272762655186	0.101522869320827	0.193269908676739	 
0.400388333956085	0.0242258313143186	0.00799984439409978	 
0.985140882956166	0.982735821742488	0.00822519796024231	 
0.00216973239819103	0.915564050240553	0.620804636015513	 
0.170770180463701	0.288450091623397	0.357825784599729	 
0.234712937095288	0.12172997792596
C:	0.585688175787352	0.237479093836628	0.435987353829905	 
0.147300012153726	0.342087245546403	0.0758172877627498	 
0.00580600999936562	0.00109267794415073	0.00794630235311888	 
0.00688337464627776	0.000618817602597609	0.0128822871237646	 
0.230733571623305	0.276650970023954	0.233013950290859	 
0.285797488742033	0.426544489294796
G:	0.146549771679813	0.564379933620146	0.162348523327608	 
0.288303072582316	0.0128822871237646	0.000618817602597609	 
0.00688337464627776	0.00794630235311888	0.00109267794415073	 
0.00580600999936562	0.0758172877627498	0.342087245546403	 
0.181573332311972	0.131046633924716	0.187725054521253	 
0.247196691859296	0.213778652608617
T:	0.110489289877649	0.0966181032223992	0.208394214165748	 
0.164008581307874	0.620804636015513	0.915564050240553	 
0.00216973239819103	0.00822519796024231	0.982735821742488	 
0.985140882956166	0.00799984439409978	0.0242258313143186	 
0.416922915601023	0.303852304427932	0.221435210588158	 
0.232292882303383	0.237946880170627


From pmr at ebi.ac.uk  Thu Sep 10 14:54:25 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 10 Sep 2009 15:54:25 +0100
Subject: [EMBOSS] PWMs in EMBOSS
In-Reply-To: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk>
References: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk>
Message-ID: <4AA91321.1020002@ebi.ac.uk>

Stephen Taylor wrote:
> Hi,
> 
> Is it possible to search PWMs in the following format in EMBOSS?

Not yet ... but we can extend the formats for PWMs. We only support the
formats that we write.

Votes please on position weight matrix formats EMBOSS should be able to
read ...

regards,

Peter


From charles-listes-emboss at plessy.org  Wed Sep 16 06:57:55 2009
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Wed, 16 Sep 2009 15:57:55 +0900
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
Message-ID: <20090916065755.GA15425@kunpuu.plessy.org>

Dear EMBOSS developers,

I have multi-sequence file in FASTQ format that contains sequencing reads, and
would like to retreive them the with seqret. But as you see in the following
example, quality scores are not preserved:

$ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
Reads and writes (returns) sequences
@F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68
AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
+
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

The purpose was to use seqret as a workaround for the fact that vectorstrip
does not keep the quality either.

Do you think that it would be possible to get this functionality as a patch in
the future, or is it big work that needs to wait for the next release?

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan


From uludag at ebi.ac.uk  Wed Sep 16 09:12:16 2009
From: uludag at ebi.ac.uk (Mahmut Uludag)
Date: Wed, 16 Sep 2009 10:12:16 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org>
References: <20090916065755.GA15425@kunpuu.plessy.org>
Message-ID: <1253092337.32439.31.camel@emboss2.ebi.ac.uk>

Hi Charles,

seqret returns quality scores if the input sequence format is explicitly
defined on the command line, such as -sformat=fastq-sanger.

The following patch looks like fixes the vectorstrip problem.


*** ajseq.c.org	2009-09-16 10:08:17.000000000 +0100
--- ajseq.c	2009-09-16 09:52:56.000000000 +0100
***************
*** 781,786 ****
--- 781,792 ----
  
      if (seq->Fttable)
  	pthis->Fttable = ajFeattableCopy(seq->Fttable);
+     
+     if (seq->Accuracy)
+     {
+     	AJCNEW0(pthis->Accuracy,seq->Seq->Len);
+
memmove(pthis->Accuracy,seq->Accuracy,seq->Seq->Len*sizeof(float));
+     }
  
      return pthis;
  }


Regards,
Mahmut


> I have multi-sequence file in FASTQ format that contains sequencing reads, and
> would like to retreive them the with seqret. But as you see in the following
> example, quality scores are not preserved:
> 
> $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
> Reads and writes (returns) sequences
> @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68
> AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
> +
> """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
> 
> The purpose was to use seqret as a workaround for the fact that vectorstrip
> does not keep the quality either.
> 
> Do you think that it would be possible to get this functionality as a patch in
> the future, or is it big work that needs to wait for the next release?


From biopython at maubp.freeserve.co.uk  Wed Sep 16 09:31:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Sep 2009 10:31:22 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org>
References: <20090916065755.GA15425@kunpuu.plessy.org>
Message-ID: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>

On Wed, Sep 16, 2009 at 7:57 AM, Charles Plessy
<charles-listes-emboss at plessy.org> wrote:
>
> Dear EMBOSS developers,
>
> I have multi-sequence file in FASTQ format that contains sequencing reads, and
> would like to retreive them the with seqret. But as you see in the following
> example, quality scores are not preserved:
>
> $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
> Reads and writes (returns) sequences
> @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68
> AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
> +
> """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

You need to use "fastq-sanger" (or the other variants), since in
EMBOSS, "fastq" currently means FASTQ ignoring the qualities.
This is documented:

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

As an EMBOSS user, I think the current situation is confusing, and
it would make much more sense to have "fastq" just an alias for
"fastq-sanger" (which would be consistent with Biopython and BioPerl).

http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html

And also this email - especially the last example:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html

> The purpose was to use seqret as a workaround for the fact that
> vectorstrip does not keep the quality either.

That's also been suggested, and is likely to be supported in future.
http://lists.open-bio.org/pipermail/emboss/2009-August/003722.html

Peter


From charles-listes-emboss at plessy.org  Thu Sep 17 06:36:59 2009
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Thu, 17 Sep 2009 15:36:59 +0900
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
References: <20090916065755.GA15425@kunpuu.plessy.org>
	<320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
Message-ID: <20090917063659.GA27021@kunpuu.plessy.org>

Le Wed, Sep 16, 2009 at 10:12:16AM +0100, Mahmut Uludag a ?crit :
> 
> seqret returns quality scores if the input sequence format is explicitly
> defined on the command line, such as -sformat=fastq-sanger.
> 
> The following patch looks like fixes the vectorstrip problem.


Le Wed, Sep 16, 2009 at 10:31:22AM +0100, Peter a ?crit :
> 
> You need to use "fastq-sanger" (or the other variants), since in
> EMBOSS, "fastq" currently means FASTQ ignoring the qualities.


Hi Mahmut and Peter, and thank you very much for your answers!

I would also like if the qualities were kept by default. I actually had tried
to force the fastq-sanger format before, but by adding its name to the USAs,
like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did
not work; I do not know if it is by design or because of the dash in the format
name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very
well after I applied Mahmut's patch.

I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too
prematurate. In particular, I have the following warning each time the quality
is encoded by an equal sign:

  Warning: Illegal character '='
  Warning: Illegal pattern: =

By the way, I think I found a bug in revseq: it seems that it does not reverse
the qualities:

  $ echo -e "@toto\nACTG\n+toto\n12/3" | seqret -filter -sformat=fastq-sanger -osformat=fastq-sanger
  @toto
  ACTG
  +
  12/3
  
  $ echo -e "@toto\nACTG\n+toto\n12/3" | revseq -filter -sformat=fastq-sanger -osformat=fastq-sanger
  @toto Reversed:
  CAGT
  +
  12/3

Also, in contrary to what the documentation predicts, using the fastq format
for the output does not ignore the quality scores. (Not that would be
particularly useful, but?)

  $ echo -e "@toto\nACTG\n+toto\nACTG" | revseq -filter -sformat=fastq-sanger -osformat=fastq
  @toto Reversed:
  CAGT
  +
  ACTG


Have a nice day,

-- 
Charles Plessy
http://charles.plessy.org
Tsurumi, Kanagawa, Japan


From pmr at ebi.ac.uk  Thu Sep 17 07:24:07 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 17 Sep 2009 08:24:07 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <20090917063659.GA27021@kunpuu.plessy.org>
References: <20090916065755.GA15425@kunpuu.plessy.org>	<320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>	<20090916065755.GA15425@kunpuu.plessy.org>	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
Message-ID: <4AB1E417.3010405@ebi.ac.uk>

Charles Plessy wrote:
  > I would also like if the qualities were kept by default. I actually 
had tried
> to force the fastq-sanger format before, but by adding its name to the USAs,
> like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did
> not work; I do not know if it is by design or because of the dash in the format
> name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very
> well after I applied Mahmut's patch.

Yes, the dash in the format name is causing problems. It should be 
allowed where there is a '::' in the USA (it is not allowed in database 
queries because of the dbname-field:value query syntax).

I will make a patch for this.

> I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too
> prematurate. In particular, I have the following warning each time the quality
> is encoded by an equal sign:
> 
>   Warning: Illegal character '='
>   Warning: Illegal pattern: =

This is surprising. Is your EMBOSS version the original distribution or 
have you applied the current patches.

If it fails with the patched version, could you send me an input file 
that causes this error.

> By the way, I think I found a bug in revseq: it seems that it does not reverse
> the qualities:

True ... this I will also patch. We have used quaslities for some years 
(in Staden experiment format) but it appears nobody has reversed 
sequences and kept the qualities. Life is changing with FASTQ data!

> Also, in contrary to what the documentation predicts, using the fastq format
> for the output does not ignore the quality scores. (Not that would be
> particularly useful, but?)

This is deliberate. We have to write somethign in FASTQ format and we 
default to the fastq-sanger format. On input, fastq-sanger ignores 
qualities because there is no safe way to decide which format is correct.

regards,

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 09:11:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 10:11:15 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB1E417.3010405@ebi.ac.uk>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
Message-ID: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>

On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>> Also, in contrary to what the documentation predicts, using the fastq
>> format for the output does not ignore the quality scores. (Not that
>> would be particularly useful, but?)
>
> This is deliberate. We have to write somethign in FASTQ format and we
> default to the fastq-sanger format. On input, fastq-sanger ignores qualities
> because there is no safe way to decide which format is correct.

So again, could you reconsider making "fastq" act like "fastq-sanger"?
The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
know which kind of FASTQ file you have, and you don't care about the
qualities, parsing it as a Sanger FASTQ file will work.

Peter C.


From pmr at ebi.ac.uk  Thu Sep 17 09:18:59 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 17 Sep 2009 10:18:59 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>	
	<20090916065755.GA15425@kunpuu.plessy.org>	
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>	
	<20090917063659.GA27021@kunpuu.plessy.org>	
	<4AB1E417.3010405@ebi.ac.uk>
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
Message-ID: <4AB1FF03.80705@ebi.ac.uk>

Peter C. wrote:
> On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> Also, in contrary to what the documentation predicts, using the fastq
>>> format for the output does not ignore the quality scores. (Not that
>>> would be particularly useful, but?)
>> This is deliberate. We have to write somethign in FASTQ format and we
>> default to the fastq-sanger format. On input, fastq-sanger ignores qualities
>> because there is no safe way to decide which format is correct.
> 
> So again, could you reconsider making "fastq" act like "fastq-sanger"?
> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
> know which kind of FASTQ file you have, and you don't care about the
> qualities, parsing it as a Sanger FASTQ file will work.

Yes, but it is dangerous if they could really be Solexa qualities.

What we could do is provide a utility that reads in fastq-sanger format 
and checks whether the quality scores make most sense as Sanger, SOlexa 
or Ilumina.

I consider reading as fastq-sanger by default to be rather dangerous.

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 09:32:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 10:32:21 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB1FF03.80705@ebi.ac.uk>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
	<4AB1FF03.80705@ebi.ac.uk>
Message-ID: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>

On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>> So again, could you reconsider making "fastq" act like "fastq-sanger"?
>> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
>> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
>> know which kind of FASTQ file you have, and you don't care about the
>> qualities, parsing it as a Sanger FASTQ file will work.
>
> Yes, but it is dangerous if they could really be Solexa qualities.

Indeed, or an Illumina 1.3+ encoded FASTQ file.

So if the EMBOSS tools are used to read a FASTQ file without specifying
the FASTQ variant, do the currently detect it is FASTQ and default to the
"fastq" setting and ignore the quality information?

> What we could do is provide a utility that reads in fastq-sanger format and
> checks whether the quality scores make most sense as Sanger, SOlexa or
> Ilumina.

That could be useful - I guess you could scan all the reads building up
a histogram of the ASCII characters used. This could immediately
rule out some of the options, and then based on the distribution (if
you assume they are raw reads) you could make a good guess.

> I consider reading as fastq-sanger by default to be rather dangerous.

That is understandable. How about removing the current "fastq" output
then? That might prevent some of the confusion at the moment. I'm
struggling to see any purpose for the current "fastq" output - can you
give me any example use case? Right now it has to pick an arbitrary
quality symbol, and uses ASCI 34 (double quote) which means PHRED
1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
Illumina 1.3+ FASTQ file.

Regards,

Peter


From pmr at ebi.ac.uk  Thu Sep 17 09:52:48 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 17 Sep 2009 10:52:48 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>	
	<20090916065755.GA15425@kunpuu.plessy.org>	
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>	
	<20090917063659.GA27021@kunpuu.plessy.org>	
	<4AB1E417.3010405@ebi.ac.uk>	
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>	
	<4AB1FF03.80705@ebi.ac.uk>
	<320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>
Message-ID: <4AB206F0.1040205@ebi.ac.uk>

Peter C. wrote:
> So if the EMBOSS tools are used to read a FASTQ file without specifying
> the FASTQ variant, do the currently detect it is FASTQ and default to the
> "fastq" setting and ignore the quality information?

Yes, exactly so.

Reading the sequence data is safe, and may be all the user wanted to do.

>> What we could do is provide a utility that reads in fastq-sanger format and
>> checks whether the quality scores make most sense as Sanger, SOlexa or
>> Ilumina.
> 
> That could be useful - I guess you could scan all the reads building up
> a histogram of the ASCII characters used. This could immediately
> rule out some of the options, and then based on the distribution (if
> you assume they are raw reads) you could make a good guess.

The ACD file would be 'interesting' We could set the default format to 
be "fastq-sanger" and issue some warning if we find the user had tried 
to change it. That way the application would run with a filename as the 
input, though it will appear to interfaces to be able to read any 
sequence input.

Are there rules we can use to decide on improbably qualities? Values 
below the Illumina and Solexa minima would seem a good guide, and 
perhaps values above the likely short read maximum score.

Maybe some existing pipelines have solme cutoff values we could adopt?

>> I consider reading as fastq-sanger by default to be rather dangerous.
> 
> That is understandable. How about removing the current "fastq" output
> then? That might prevent some of the confusion at the moment. I'm
> struggling to see any purpose for the current "fastq" output - can you
> give me any example use case? Right now it has to pick an arbitrary
> quality symbol, and uses ASCI 34 (double quote) which means PHRED
> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
> Illumina 1.3+ FASTQ file.

It is an alias for fastq-sanger which should be OK. I prefer to have an 
output format name for each input format name where it looks sensible, 
so if we read "fastq" as an input format it should do something on 
output. Unfortunately that means it has to write quality scores somehow.

regards,

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 10:06:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 11:06:16 +0100
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB206F0.1040205@ebi.ac.uk>
References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
	<320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com>
	<4AB1FF03.80705@ebi.ac.uk>
	<320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com>
	<4AB206F0.1040205@ebi.ac.uk>
Message-ID: <320fb6e00909170306s328294e9h72717c329baba057@mail.gmail.com>

On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
>>> What we could do is provide a utility that reads in fastq-sanger format
>>> and checks whether the quality scores make most sense as Sanger,
>>> SOlexa or Ilumina.
>>
>> That could be useful - I guess you could scan all the reads building up
>> a histogram of the ASCII characters used. This could immediately
>> rule out some of the options, and then based on the distribution (if
>> you assume they are raw reads) you could make a good guess.
>
> The ACD file would be 'interesting' We could set the default format to be
> "fastq-sanger" and issue some warning if we find the user had tried to
> change it. That way the application would run with a filename as the input,
> though it will appear to interfaces to be able to read any sequence input.
>
> Are there rules we can use to decide on improbably qualities? Values below
> the Illumina and Solexa minima would seem a good guide, and perhaps
> values above the likely short read maximum score.
>
> Maybe some existing pipelines have solme cutoff values we could adopt?

Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina
reads should be easy. However, unless there are some ASCII characters
in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way
to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good
reads above Solexa/PHRED 10 (which would be ASCII 74), either way
it isn't going to make much difference. In any case, it will be heuristic,
and sometimes it will get it wrong (e.g. post processed Sanger FASTQ
files with high scores might look like raw reads in Solexa/Illumina
FASTQ).

>>> I consider reading as fastq-sanger by default to be rather dangerous.
>>
>> That is understandable. How about removing the current "fastq" output
>> then? That might prevent some of the confusion at the moment. I'm
>> struggling to see any purpose for the current "fastq" output - can you
>> give me any example use case? Right now it has to pick an arbitrary
>> quality symbol, and uses ASCI 34 (double quote) which means PHRED
>> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
>> Illumina 1.3+ FASTQ file.
>
> It is an alias for fastq-sanger which should be OK. I prefer to have an
> output format name for each input format name where it looks sensible,
> so if we read "fastq" as an input format it should do something on
> output. Unfortunately that means it has to write quality scores somehow.

I'm not convinced that the current "fastq" output (with the double quote
quality string) is entirely "sensible". But I'll drop this now - I've argued my
case, and will leave it at that. As long as the current behaviour is clear
in the documentation, it should be OK.

Regards,

Peter


From charles-listes-emboss at plessy.org  Fri Sep 18 13:09:46 2009
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Fri, 18 Sep 2009 22:09:46 +0900
Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
In-Reply-To: <4AB1E417.3010405@ebi.ac.uk>
References: <20090916065755.GA15425@kunpuu.plessy.org>
	<320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com>
	<20090916065755.GA15425@kunpuu.plessy.org>
	<1253092337.32439.31.camel@emboss2.ebi.ac.uk>
	<20090917063659.GA27021@kunpuu.plessy.org>
	<4AB1E417.3010405@ebi.ac.uk>
Message-ID: <20090918130946.GC16344@kunpuu.plessy.org>

Le Thu, Sep 17, 2009 at 08:24:07AM +0100, Peter Rice a ?crit :
> Charles Plessy wrote:
>
>> I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too
>> prematurate. In particular, I have the following warning each time the quality
>> is encoded by an equal sign:
>>
>>   Warning: Illegal character '='
>>   Warning: Illegal pattern: =
>
> This is surprising. Is your EMBOSS version the original distribution or  
> have you applied the current patches.

Actually, I worked on other files today and could not reproduce the result. I
will find a proper sourceforge bug if it ever comes back?

Have a nice week-end,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japon


From belegdol at gmail.com  Tue Sep 29 19:34:56 2009
From: belegdol at gmail.com (Julian Sikorski)
Date: Tue, 29 Sep 2009 21:34:56 +0200
Subject: [EMBOSS] Packaging EMBOSS for Fedora
In-Reply-To: <h4pdaa$ad5$1@ger.gmane.org>
References: <h0uef8$bdu$1@ger.gmane.org>	<1244850556.8999.7.camel@login-svr1.ebi.ac.uk>	<h3v43r$tc5$1@ger.gmane.org>	<49993.78.105.201.225.1248020309.squirrel@webmail.ebi.ac.uk>
	<h4pdaa$ad5$1@ger.gmane.org>
Message-ID: <h9tnh1$jfv$1@ger.gmane.org>

W dniu 29.07.2009 13:54, Julian Sikorski pisze:
> W dniu 19.07.2009 18:18, uludag at ebi.ac.uk pisze:
>>
>>> there seem to be some problems with make install:
>>>
>>> /usr/bin/make  install-exec-hook
>>> make[7]: Entering directory
>>> `/builddir/build/BUILD/EMBOSS-6.1.0/jemboss/org/emboss/jemboss/editor'
>>> mkdir -p --
>>> /builddir/build/BUILDROOT/EMBOSS-6.1.0-1.fc11.x86_64/usr/share/EMBOSS/jemboss/org/emboss/jemboss/editor
>>> /usr/bin/install: cannot stat `*.class': No such file or directory
>>
>> Looks like we didn't test the --with-java and --with-javaos configure
>> options well, before this release. However, most users will not need these
>> two options any more as EMBOSS-6.1.0 includes precompiled jemboss class
>> files collected in a java archive file. You should hopefully not get the
>> above error if you omit these two options when you configure your emboss
>> installation.
>>
>> Regards,
>> Mahmut
> Thank you, removing these two seems to have done the trick!
> 
> Julian
Another problem arose. I was pointed out that another problem pointed
out by rpmlint should be fixed:

EMBOSS-libs.x86_64: W: shared-lib-calls-exit
/usr/lib64/libeplplot.so.3.2.7 exit@

According to the tool, it means the following:

This library package calls exit() or _exit(), probably in a non-fork()
context. Doing so from a library is strongly discouraged - when a
library function calls exit(), it prevents the calling program from
handling the error, reporting it to the user, closing files properly,
and cleaning up any state that the program has. It is preferred for the
library to return an actual error code and let the calling program
decide how to handle the situation.

Also, the guidelines [1] say that the .jar files should be compiled from
source, so fixing the problem with --with-java would definitely help here.

Julian

[1]
https://fedoraproject.org/wiki/Packaging:Java#Pre-built_JAR_files_.2F_Other_bundled_software