[Bioperl-l] Next Gen Formats

Fri Mar 12 08:09:40 EST 2010

Here is an example of a color-space sequence:

In one file (something.csfasta):

 >1_30_226_F3
T210320010.200.03.0110320320220212200122200.2220200
 >1_30_252_F3
T322220212.133.00.2202322132022202221002011.0011020

The '.' means the color could not be called

In another file (something.qual):

 >1_30_226_F3
4 4 27 17 31 7 24 26 13 -1 10 25 14 -1 26 4 -1 19 9 5 6 14 12 6 9 4 4 7 
7 20 4 4 19 12 12 4 4 12 10 10 5 4 -1 13 16 8 4 15 4 4
 >1_30_252_F3
18 4 19 15 9 4 4 5 4 -1 6 4 5 -1 5 6 -1 9 6 4 4 4 6 4 4 4 4 5 8 4 8 7 4 
7 5 4 4 10 9 12 8 4 -1 6 5 5 4 10 4 12

The -1 represents those colors that could not be called.

Chris Fields wrote:
> On Mar 12, 2010, at 4:06 AM, Peter wrote:
> 
>> On Fri, Mar 12, 2010 at 3:35 AM, Chris Fields <cjfields at illinois.edu> wrote:
>>> Ryan,
>>>
>>> We would have to see example files to get an idea of how feasible it is.
>>>  You could possibly use a Bio::SeqIO::fasta and a Bio::SeqIO::qual
>>> stream, and interleave the two somehow.  However, BioPerl qual
>>> scores are PHRED-based by default, and I'm not sure how color-space
>>> data would work within that schematic.
>>>
>>> chris
>> Chris,
>>
>> I am under the (possibly mistaken) assumption that PHRED scores
>> are used for SOLiD color space QUAL files - the key issue is each
>> score corresponds to the color call in the color sequence.
>>
>> Ignoring color-space for a moment, are there BioPerl examples
>> of iterating over a pair of sequence-space FASTA and QUAL files?
>> i.e. What you'd get if you had a FASTQ file to iterate over.
>>
>> [I guess Ryan could just merge the color-space FASTA and
>> QUAL into a color-space FASTQ file and iterate over that]
>>
>> Peter
> 
> If they're PHRED scores then it should be fine, though we may need to work in a few color-space specific things.
> 
> Iterating over pairs is something that has popped up before.  For output, in the Bio::SeqIO::fastq module there is code for writing fasta/qual (to two separate streams), where I'm assuming one could do something like:
> 
> --------------------------------
> my $in = Bio::SeqIO->new(-format => 'fastq', -file => 'foo.fastq');
> my $out1 = Bio::SeqIO->new(-format => 'fastq', -file => '>foo.fasta'); 
> my $out2 = Bio::SeqIO->new(-format => 'fastq', -file => '>foo.qual'); 
> 
> while (my $seq = $in->next_seq) {
>     $out1->write_fasta($seq);
>     $out2->write_fasta($seq);
> }
> --------------------------------
> 
> Note that all use the 'fastq' formatm instead of 'fasta' or 'qual'.  This should work for those as well, just haven't tried it myself (it's a bug otherwise).
> 
> I'm assuming for input it would be something like:
> 
> --------------------------------
> my $in1 = Bio::SeqIO->new(-format => 'fasta', -file => 'foo.fasta');
> my $in2 = Bio::SeqIO->new(-format => 'qual', -file => 'foo.qual'); 
> my $out = Bio::SeqIO->new(-format => 'fastq', -file => '>foo.fastq'); 
> 
> # 'qual' parser joins the two streams
> while (my $seq = $in2->next_seq($in1)) {
>     $out->write_seq($seq);
> }
> --------------------------------
> 
> chris
> 
>