[Bioperl-l] Next-gen modules

Tue Jun 23 20:34:48 UTC 2009

On Jun 23, 2009, at 7:22 AM, Peter Rice wrote:

> Peter wrote:
> ...
>>> Parsing was easy - find the '@' line, read sequence until the '+'  
>>> line
>>> is reached, then read (seqlen) quality characters ... and check  
>>> the next
>>> line starts with '@'
>>
>> That is basically what I did for Biopython.

This is now what bioperl will do (at least when I commit changes today  
or tomorrow).

> ...
>>> We gave up on trying to guess the quality score standard and require
>>> users to say whether they are sanger, solexa (1.0) or Illumina (1.3)
>>> format files. If we only want the sequence then we don't care so  
>>> we allow
>>> "fastq" as a sequence format and ignore the quality scores in that  
>>> case.
>>
>> What format names have you used? Ideally we'd have the same names
>> in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and
>> "fastq-illumina").
>
> We don't normally use '-' in our format names so we have fastqsanger,
> fastqsolexa, fastqillumina and fastqint. None of these have been tried
> on users as yet.
>
> The '-' names look nice though. We can consider introducing them. Do  
> you
> have a full list of format names (sequence, feature, alignment,  
> etc.) we
> can try to conform to?

We (bioperl) are using biopython's convention of format-variant, or at  
least that's how I'm coding it up.  With SeqIO it's fairly easy to  
check for the format variant prior to loading the class and pass it in  
as a second named parameter.

I have actually thought of adding in fastqint as an option (it would  
be fairly easy to do).

chris