[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Tue Jun 23 20:34:48 UTC 2009
On Jun 23, 2009, at 7:22 AM, Peter Rice wrote:
> Peter wrote:
> ...
>>> Parsing was easy - find the '@' line, read sequence until the '+'
>>> line
>>> is reached, then read (seqlen) quality characters ... and check
>>> the next
>>> line starts with '@'
>>
>> That is basically what I did for Biopython.
This is now what bioperl will do (at least when I commit changes today
or tomorrow).
> ...
>>> We gave up on trying to guess the quality score standard and require
>>> users to say whether they are sanger, solexa (1.0) or Illumina (1.3)
>>> format files. If we only want the sequence then we don't care so
>>> we allow
>>> "fastq" as a sequence format and ignore the quality scores in that
>>> case.
>>
>> What format names have you used? Ideally we'd have the same names
>> in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and
>> "fastq-illumina").
>
> We don't normally use '-' in our format names so we have fastqsanger,
> fastqsolexa, fastqillumina and fastqint. None of these have been tried
> on users as yet.
>
> The '-' names look nice though. We can consider introducing them. Do
> you
> have a full list of format names (sequence, feature, alignment,
> etc.) we
> can try to conform to?
We (bioperl) are using biopython's convention of format-variant, or at
least that's how I'm coding it up. With SeqIO it's fairly easy to
check for the format variant prior to loading the class and pass it in
as a second named parameter.
I have actually thought of adding in fastqint as an option (it would
be fairly easy to do).
chris
More information about the Bioperl-l
mailing list