[emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS

Peter biopython at maubp.freeserve.co.uk
Fri Jul 24 13:32:49 UTC 2009


Hi all,

Peter Rice kindly said he will look into an OBF cross project mailing
list, but in the meantime this has been cross posted to the Biopython,
BioPerl, and EMBOSS development lists.

On Thu, Jul 23, 2009 at 11:58 PM, Chris Fields<cjfields at illinois.edu> wrote:
>> I'd like to get comparisons against BioPerl's new FASTQ support
>> going too. To do this I'd need to know which (branch?) of BioPerl I
>> should install, and I'd also like a trivial sample BioPerl script to do
>> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say
>> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning
>> the Sanger Standard FASTQ).
>
> You would have to install svn (bioperl-live) if you want the refactored
> fastq.  That commit was within the last month.

I've got SVN bioperl-live installed and apparently working :)

>> i.e. Something like this four line Biopython script would be perfect:
>> http://biopython.org/wiki/Reading_from_unix_pipes
>
> We use named parameters so it's a little more verbose.
>
> use Bio::SeqIO;
> my $in  = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger');
> my $out = Bio::SeqIO->new(-format => 'fastq-solexa');
> while (my $seq = $in->next_seq) { $out->write_seq($seq) }
>
> Don't be surprised if there are still bugs lurking about, just let me know
> and I'll fix 'em.

I've got a bug report coming up in a second email, but the basics work :)

e.g. Using this Sanger style FASTQ file, and converting it to Solexa style
http://biopython.org/SRC/biopython/Tests/Quality/example.fastq

$ more example.fastq
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333

This is simple three record FASTQ file (in the Sanger format).

Using EMBOSS 6.1.0:

$ seqret -filter -sformat fastq-sanger -osformat fastq-solexa < example.fastq
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
ZZRZZZZZZZZZZZZVZZZZZZZWW
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
ZZZZZZZZZZZVZZZZZLZZZRZWR
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
ZZZZZZZZZZZXZVZZMVZRXRRRR

Using BioPerl:

$ perl bioperl_sanger2solexa.pl < example.fastq
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
ZZRZZZZZZZZZZZZVZZZZZZZWW
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+EAS54_6_R1_2_1_540_792
ZZZZZZZZZZZVZZZZZLZZZRZWR
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
ZZZZZZZZZZZXZVZZMVZRXRRRR

Using Biopython:

$ python biopython_sanger2solexa.py < example.fastq
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
ZZRZZZZZZZZZZZZVZZZZZZZWW
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
ZZZZZZZZZZZVZZZZZLZZZRZWR
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
ZZZZZZZZZZZXZVZZMVZRXRRRR

They all agree, except that Biopython has followed the MAQ
convention of omitting the (optional) repeat of the captions
on the plus lines. This is something I'd already asked Peter
Rice about for EMBOSS (but I think we got sidetracked):
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html

Peter




More information about the emboss-dev mailing list