[Bioperl-l] fastq splitter

Michael Muratet mmuratet at hudsonalpha.org
Wed Feb 29 08:07:48 EST 2012


On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:

> Hi Chris,
> Unfortunately the read pairs are not consecutive. It seems they are  
> cat'd
> together.
> I could use split -l on the line number that they're glued together  
> I guess.
> If this is an overnight job for a bunch of files, I can wait so  
> don't mind
> using the module if it worked.
>
> Someone pointed out I need to switch $seqin->desc to $inseq->desc.
> However, now it spits out fasta output instead of fastq and returns  
> a bunch
> of warnings: Seq/Qual descriptions don't match; using sequence  
> description
Hi Sean
Apparently the bioperl parser expects the the 'second' header line,  
i.e.,

@first_header
sequence
+second_header
quality_scores

to have the same (redundant) identifier. When it encounters a blank  
line, which is the way the Illumina pipeline writes it out, it warns  
you.

I think you have to explicitly write out the quality scores in fastq  
format.

Cheers

Mike

>
> Hmm.
>
> On 28 February 2012 16:50, Fields, Christopher J <cjfields at illinois.edu 
> >wrote:
>
>> Sean,
>>
>> If you trust the data enough, in that:
>>
>> 1) each record is 4 lines,
>> 2) mate pairs are consecutive in the file, and
>> 3) that read 1 always preceeds read 2 in the pair,
>>
>> then I would simply iterate through 4 lines at a time and dump to  
>> the two
>> separate files, maybe using a flip-flop or simple record count and  
>> modulus
>> switch.  You can always run a check on the header with a regex if  
>> you don't
>> trust it completely.
>>
>> Just from the sanity point-of-view, unless you're doing a lot of
>> validation I wouldn't use Bio::SeqIO::fastq, unless you have some  
>> time on
>> your hands and a relatively low number of seqs (it's notoriously  
>> slow at
>> the moment).
>>
>> chris
>>
>> On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
>>
>>> Hi,
>>> I'm trying to write a quick script to separate one large PE fastq  
>>> file
>> into
>>> 2 separate files, one for each mate pair
>>>
>>> The file is of the format (mate1)
>>> @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
>>> CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
>>> +
>>> BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
>>>
>>> && (mate2)
>>>
>>> @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
>>> TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
>>> +
>>> ##################################################
>>>
>>>
>>> My idea is to separate using a regex such that / 1:/ would be the  
>>> first
>>> mate pair and / 2:/ would go in the second mate file.
>>> I implemented the code below but each output file is empty. Can  
>>> someone
>>> spot my error?
>>>
>>> Thanks,
>>> Sean.
>>>
>>> my $infile   = shift;
>>> my $outfile1 = $infile."_1";
>>> my $outfile2 = $infile."_2";
>>>
>>> my $seqin = Bio::SeqIO->new(
>>>                            -file   => "<$infile",
>>>                            -format => "fastq",
>>>                            );
>>> my $seqout1 = Bio::SeqIO->new(
>>>                             -file   => ">$outfile1",
>>>                             -format => "fastq",
>>>                             );
>>>
>>> my $seqout2 = Bio::SeqIO->new(
>>>                             -file   => ">$outfile2",
>>>                             -format => "fastq",
>>>                             );
>>> while (my $inseq = $seqin->next_seq) {
>>>   if ($seqin->desc =~ / 1:/){
>>>     $seqout1->write_seq($inseq);
>>>   } else {
>>>     $seqout2->write_seq($inseq);
>>>   }
>>> }
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Michael Muratet, Ph.D.
Senior Scientist
HudsonAlpha Institute for Biotechnology
mmuratet at hudsonalpha.org
(256) 327-0473 (p)
(256) 327-0966 (f)

Room 4005
601 Genome Way
Huntsville, Alabama 35806








More information about the Bioperl-l mailing list