[Bioperl-l] fastq splitter

Wed Feb 29 09:38:50 EST 2012

On Feb 29, 2012, at 7:07 AM, Michael Muratet wrote:

> On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
> 
>> Hi Chris,
>> Unfortunately the read pairs are not consecutive. It seems they are cat'd
>> together.
>> I could use split -l on the line number that they're glued together I guess.
>> If this is an overnight job for a bunch of files, I can wait so don't mind
>> using the module if it worked.
>> 
>> Someone pointed out I need to switch $seqin->desc to $inseq->desc.
>> However, now it spits out fasta output instead of fastq and returns a bunch
>> of warnings: Seq/Qual descriptions don't match; using sequence description
> Hi Sean
> Apparently the bioperl parser expects the the 'second' header line, i.e.,
> 
> @first_header
> sequence
> +second_header
> quality_scores
> 
> to have the same (redundant) identifier. When it encounters a blank line, which is the way the Illumina pipeline writes it out, it warns you.
> 
> I think you have to explicitly write out the quality scores in fastq format.
> 
> Cheers
> 
> Mike

Actually no, that's not true for the latest versions.  It was completely refactored in coordination with Peter Cock (Biopython) and the other Bio* toolkits along with EMBOSS to parse a wide range of FASTQ data (including the solexa/illumina variants), and also attempt to catch bad formatting issues.  See this pub:

http://www.ncbi.nlm.nih.gov/pubmed/20015970

This is one of the primary test examples that passes:

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333

chris