[Bioperl-l] fastq splitter
Fields, Christopher J
cjfields at illinois.edu
Wed Feb 29 09:38:50 EST 2012
On Feb 29, 2012, at 7:07 AM, Michael Muratet wrote:
> On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
>
>> Hi Chris,
>> Unfortunately the read pairs are not consecutive. It seems they are cat'd
>> together.
>> I could use split -l on the line number that they're glued together I guess.
>> If this is an overnight job for a bunch of files, I can wait so don't mind
>> using the module if it worked.
>>
>> Someone pointed out I need to switch $seqin->desc to $inseq->desc.
>> However, now it spits out fasta output instead of fastq and returns a bunch
>> of warnings: Seq/Qual descriptions don't match; using sequence description
> Hi Sean
> Apparently the bioperl parser expects the the 'second' header line, i.e.,
>
> @first_header
> sequence
> +second_header
> quality_scores
>
> to have the same (redundant) identifier. When it encounters a blank line, which is the way the Illumina pipeline writes it out, it warns you.
>
> I think you have to explicitly write out the quality scores in fastq format.
>
> Cheers
>
> Mike
Actually no, that's not true for the latest versions. It was completely refactored in coordination with Peter Cock (Biopython) and the other Bio* toolkits along with EMBOSS to parse a wide range of FASTQ data (including the solexa/illumina variants), and also attempt to catch bad formatting issues. See this pub:
http://www.ncbi.nlm.nih.gov/pubmed/20015970
This is one of the primary test examples that passes:
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333
chris
More information about the Bioperl-l
mailing list