[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS

Fri Jul 24 18:38:51 UTC 2009

On Jul 24, 2009, at 8:32 AM, Peter wrote:

> Hi all,
>
> Peter Rice kindly said he will look into an OBF cross project mailing
> list, but in the meantime this has been cross posted to the Biopython,
> BioPerl, and EMBOSS development lists.

That's a great idea!  Would help cut down on the cross-posting (I'm  
getting this directly and via bioperl and biopython).

> On Thu, Jul 23, 2009 at 11:58 PM, Chris  
> Fields<cjfields at illinois.edu> wrote:
>>> I'd like to get comparisons against BioPerl's new FASTQ support
>>> going too. To do this I'd need to know which (branch?) of BioPerl I
>>> should install, and I'd also like a trivial sample BioPerl script  
>>> to do
>>> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say
>>> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning
>>> the Sanger Standard FASTQ).
>>
>> You would have to install svn (bioperl-live) if you want the  
>> refactored
>> fastq.  That commit was within the last month.
>
> I've got SVN bioperl-live installed and apparently working :)
>
>>> i.e. Something like this four line Biopython script would be  
>>> perfect:
>>> http://biopython.org/wiki/Reading_from_unix_pipes
>>
>> We use named parameters so it's a little more verbose.
>>
>> use Bio::SeqIO;
>> my $in  = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger');
>> my $out = Bio::SeqIO->new(-format => 'fastq-solexa');
>> while (my $seq = $in->next_seq) { $out->write_seq($seq) }
>>
>> Don't be surprised if there are still bugs lurking about, just let  
>> me know
>> and I'll fix 'em.
>
> I've got a bug report coming up in a second email, but the basics  
> work :)
>
> e.g. Using this Sanger style FASTQ file, and converting it to Solexa  
> style
> http://biopython.org/SRC/biopython/Tests/Quality/example.fastq
>
> $ more example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +
> ;;3;;;;;;;;;;;;7;;;;;;;88
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +
> ;;;;;;;;;;;7;;;;;-;;;3;83
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +
> ;;;;;;;;;;;9;7;;.7;393333
>
> This is simple three record FASTQ file (in the Sanger format).
>
> Using EMBOSS 6.1.0:
>
> $ seqret -filter -sformat fastq-sanger -osformat fastq-solexa <  
> example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ZZRZZZZZZZZZZZZVZZZZZZZWW
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +EAS54_6_R1_2_1_540_792
> ZZZZZZZZZZZVZZZZZLZZZRZWR
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +EAS54_6_R1_2_1_443_348
> ZZZZZZZZZZZXZVZZMVZRXRRRR
>
> Using BioPerl:
>
> $ perl bioperl_sanger2solexa.pl < example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ZZRZZZZZZZZZZZZVZZZZZZZWW
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +EAS54_6_R1_2_1_540_792
> ZZZZZZZZZZZVZZZZZLZZZRZWR
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +EAS54_6_R1_2_1_443_348
> ZZZZZZZZZZZXZVZZMVZRXRRRR
>
> Using Biopython:
>
> $ python biopython_sanger2solexa.py < example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +
> ZZRZZZZZZZZZZZZVZZZZZZZWW
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +
> ZZZZZZZZZZZVZZZZZLZZZRZWR
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +
> ZZZZZZZZZZZXZVZZMVZRXRRRR
>
> They all agree, except that Biopython has followed the MAQ
> convention of omitting the (optional) repeat of the captions
> on the plus lines. This is something I'd already asked Peter
> Rice about for EMBOSS (but I think we got sidetracked):
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html
>
> Peter

Good to know the conversion is working, I was basically writing some  
code in the dark on that (and keeping my fingers crossed ;)

As for the optional header, we could add a flag to allow the user the  
option of printing it or not.  Would be easy enough; we can follow  
your lead as to what the default behavior is.  I'll take a look at the  
bug and try to get it into the next point release, hopefully not be  
anything too hard to fix.

chris