[Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS
Chris Fields
cjfields at illinois.edu
Fri Jul 24 14:38:51 EDT 2009
On Jul 24, 2009, at 8:32 AM, Peter wrote:
> Hi all,
>
> Peter Rice kindly said he will look into an OBF cross project mailing
> list, but in the meantime this has been cross posted to the Biopython,
> BioPerl, and EMBOSS development lists.
That's a great idea! Would help cut down on the cross-posting (I'm
getting this directly and via bioperl and biopython).
> On Thu, Jul 23, 2009 at 11:58 PM, Chris
> Fields<cjfields at illinois.edu> wrote:
>>> I'd like to get comparisons against BioPerl's new FASTQ support
>>> going too. To do this I'd need to know which (branch?) of BioPerl I
>>> should install, and I'd also like a trivial sample BioPerl script
>>> to do
>>> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say
>>> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning
>>> the Sanger Standard FASTQ).
>>
>> You would have to install svn (bioperl-live) if you want the
>> refactored
>> fastq. That commit was within the last month.
>
> I've got SVN bioperl-live installed and apparently working :)
>
>>> i.e. Something like this four line Biopython script would be
>>> perfect:
>>> http://biopython.org/wiki/Reading_from_unix_pipes
>>
>> We use named parameters so it's a little more verbose.
>>
>> use Bio::SeqIO;
>> my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger');
>> my $out = Bio::SeqIO->new(-format => 'fastq-solexa');
>> while (my $seq = $in->next_seq) { $out->write_seq($seq) }
>>
>> Don't be surprised if there are still bugs lurking about, just let
>> me know
>> and I'll fix 'em.
>
> I've got a bug report coming up in a second email, but the basics
> work :)
>
> e.g. Using this Sanger style FASTQ file, and converting it to Solexa
> style
> http://biopython.org/SRC/biopython/Tests/Quality/example.fastq
>
> $ more example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +
> ;;3;;;;;;;;;;;;7;;;;;;;88
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +
> ;;;;;;;;;;;7;;;;;-;;;3;83
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +
> ;;;;;;;;;;;9;7;;.7;393333
>
> This is simple three record FASTQ file (in the Sanger format).
>
> Using EMBOSS 6.1.0:
>
> $ seqret -filter -sformat fastq-sanger -osformat fastq-solexa <
> example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ZZRZZZZZZZZZZZZVZZZZZZZWW
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +EAS54_6_R1_2_1_540_792
> ZZZZZZZZZZZVZZZZZLZZZRZWR
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +EAS54_6_R1_2_1_443_348
> ZZZZZZZZZZZXZVZZMVZRXRRRR
>
> Using BioPerl:
>
> $ perl bioperl_sanger2solexa.pl < example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ZZRZZZZZZZZZZZZVZZZZZZZWW
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +EAS54_6_R1_2_1_540_792
> ZZZZZZZZZZZVZZZZZLZZZRZWR
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +EAS54_6_R1_2_1_443_348
> ZZZZZZZZZZZXZVZZMVZRXRRRR
>
> Using Biopython:
>
> $ python biopython_sanger2solexa.py < example.fastq
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +
> ZZRZZZZZZZZZZZZVZZZZZZZWW
> @EAS54_6_R1_2_1_540_792
> TTGGCAGGCCAAGGCCGATGGATCA
> +
> ZZZZZZZZZZZVZZZZZLZZZRZWR
> @EAS54_6_R1_2_1_443_348
> GTTGCTTCTGGCGTGGGTGGGGGGG
> +
> ZZZZZZZZZZZXZVZZMVZRXRRRR
>
> They all agree, except that Biopython has followed the MAQ
> convention of omitting the (optional) repeat of the captions
> on the plus lines. This is something I'd already asked Peter
> Rice about for EMBOSS (but I think we got sidetracked):
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html
>
> Peter
Good to know the conversion is working, I was basically writing some
code in the dark on that (and keeping my fingers crossed ;)
As for the optional header, we could add a flag to allow the user the
option of printing it or not. Would be easy enough; we can follow
your lead as to what the default behavior is. I'll take a look at the
bug and try to get it into the next point release, hopefully not be
anything too hard to fix.
chris
More information about the Bioperl-l
mailing list