[Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS
Chris Fields
cjfields at illinois.edu
Thu Jul 30 20:08:36 UTC 2009
On Jul 30, 2009, at 10:55 AM, Peter wrote:
> On Thu, Jul 30, 2009 at 4:46 PM, Chris Fields<cjfields at illinois.edu>
> wrote:
>>> The EMBOSS patch I was testing from Peter Rice went for a
>>> silent truncation, in Biopython have also for the moment gone
>>> for silently imposing the maximum scores (ASCII 126, 0x7e)
>>> of 93, 62 and 62 for the three formats. Another reason for this
>>> is speed.
>>>
>>> Peter
>>
>> Speed is one reason to worry, but we also should think carefully
>> about
>> silently truncating the data w/o the user's knowledge. One thing we
>> don't want to propagate is loss of data w/o warning.
>
> Yes and no. Do you warn about converting from EMBL/GenBank to
> FASTA? Or from a PFAM alignment to a ClustalW or PHYLIP
> alignment? In those cases, anyone familiar with the file formats will
> expect data loss as you are going from a richly annotated file format
> to something much simpler.
Right, but this doesn't follow along the same lines. Going from a
annotation- and feature-rich format to a very lightweight format is
one thing. This situation (at least to me) is more analogous to
exclusion of a subset of features b/c they don't fit certain parameters.
I do think if it affects performance to a significant enough degree we
can do this silently, we just need to ensure this is well-documented.
My opinions is this use will prove to be a edge case anyway (most will
want conversion to Sanger vs. Illumina/Solexa).
> Likewise here, anyone familiar with the FASTQ variants (and our
> documentation should cover this) shouldn't be surprised at this
> quality truncation. But I must concede, this is a more subtle and
> less obvious data issue. So maybe you are right.
>
> I can take a look at this and see how badly it would impact the
> speed for Biopython...
>
> Peter
Will try to do the same for bioperl.
chris
More information about the Open-Bio-l
mailing list