[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Thu Jul 30 16:08:36 EDT 2009

On Jul 30, 2009, at 10:55 AM, Peter wrote:

> On Thu, Jul 30, 2009 at 4:46 PM, Chris Fields<cjfields at illinois.edu>  
> wrote:
>>> The EMBOSS patch I was testing from Peter Rice went for a
>>> silent truncation, in Biopython have also for the moment gone
>>> for silently imposing the maximum scores (ASCII 126, 0x7e)
>>> of 93, 62 and 62 for the three formats. Another reason for this
>>> is speed.
>>>
>>> Peter
>>
>> Speed is one reason to worry, but we also should think carefully  
>> about
>> silently truncating the data w/o the user's knowledge.  One thing we
>> don't want to propagate is loss of data w/o warning.
>
> Yes and no. Do you warn about converting from EMBL/GenBank to
> FASTA? Or from a PFAM alignment to a ClustalW or PHYLIP
> alignment? In those cases, anyone familiar with the file formats will
> expect data loss as you are going from a richly annotated file format
> to something much simpler.

Right, but this doesn't follow along the same lines.  Going from a  
annotation- and feature-rich format to a very lightweight format is  
one thing.  This situation (at least to me) is more analogous to  
exclusion of a subset of features b/c they don't fit certain parameters.

I do think if it affects performance to a significant enough degree we  
can do this silently, we just need to ensure this is well-documented.   
My opinions is this use will prove to be a edge case anyway (most will  
want conversion to Sanger vs. Illumina/Solexa).

> Likewise here, anyone familiar with the FASTQ variants (and our
> documentation should cover this) shouldn't be surprised at this
> quality truncation. But I must concede, this is a more subtle and
> less obvious data issue. So maybe you are right.
>
> I can take a look at this and see how badly it would impact the
> speed for Biopython...
>
> Peter

Will try to do the same for bioperl.

chris