[Bioperl-l] fastq splitter
Michael Muratet
mmuratet at hudsonalpha.org
Wed Feb 29 08:07:48 EST 2012
On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
> Hi Chris,
> Unfortunately the read pairs are not consecutive. It seems they are
> cat'd
> together.
> I could use split -l on the line number that they're glued together
> I guess.
> If this is an overnight job for a bunch of files, I can wait so
> don't mind
> using the module if it worked.
>
> Someone pointed out I need to switch $seqin->desc to $inseq->desc.
> However, now it spits out fasta output instead of fastq and returns
> a bunch
> of warnings: Seq/Qual descriptions don't match; using sequence
> description
Hi Sean
Apparently the bioperl parser expects the the 'second' header line,
i.e.,
@first_header
sequence
+second_header
quality_scores
to have the same (redundant) identifier. When it encounters a blank
line, which is the way the Illumina pipeline writes it out, it warns
you.
I think you have to explicitly write out the quality scores in fastq
format.
Cheers
Mike
>
> Hmm.
>
> On 28 February 2012 16:50, Fields, Christopher J <cjfields at illinois.edu
> >wrote:
>
>> Sean,
>>
>> If you trust the data enough, in that:
>>
>> 1) each record is 4 lines,
>> 2) mate pairs are consecutive in the file, and
>> 3) that read 1 always preceeds read 2 in the pair,
>>
>> then I would simply iterate through 4 lines at a time and dump to
>> the two
>> separate files, maybe using a flip-flop or simple record count and
>> modulus
>> switch. You can always run a check on the header with a regex if
>> you don't
>> trust it completely.
>>
>> Just from the sanity point-of-view, unless you're doing a lot of
>> validation I wouldn't use Bio::SeqIO::fastq, unless you have some
>> time on
>> your hands and a relatively low number of seqs (it's notoriously
>> slow at
>> the moment).
>>
>> chris
>>
>> On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
>>
>>> Hi,
>>> I'm trying to write a quick script to separate one large PE fastq
>>> file
>> into
>>> 2 separate files, one for each mate pair
>>>
>>> The file is of the format (mate1)
>>> @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
>>> CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
>>> +
>>> BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
>>>
>>> && (mate2)
>>>
>>> @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
>>> TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
>>> +
>>> ##################################################
>>>
>>>
>>> My idea is to separate using a regex such that / 1:/ would be the
>>> first
>>> mate pair and / 2:/ would go in the second mate file.
>>> I implemented the code below but each output file is empty. Can
>>> someone
>>> spot my error?
>>>
>>> Thanks,
>>> Sean.
>>>
>>> my $infile = shift;
>>> my $outfile1 = $infile."_1";
>>> my $outfile2 = $infile."_2";
>>>
>>> my $seqin = Bio::SeqIO->new(
>>> -file => "<$infile",
>>> -format => "fastq",
>>> );
>>> my $seqout1 = Bio::SeqIO->new(
>>> -file => ">$outfile1",
>>> -format => "fastq",
>>> );
>>>
>>> my $seqout2 = Bio::SeqIO->new(
>>> -file => ">$outfile2",
>>> -format => "fastq",
>>> );
>>> while (my $inseq = $seqin->next_seq) {
>>> if ($seqin->desc =~ / 1:/){
>>> $seqout1->write_seq($inseq);
>>> } else {
>>> $seqout2->write_seq($inseq);
>>> }
>>> }
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Michael Muratet, Ph.D.
Senior Scientist
HudsonAlpha Institute for Biotechnology
mmuratet at hudsonalpha.org
(256) 327-0473 (p)
(256) 327-0966 (f)
Room 4005
601 Genome Way
Huntsville, Alabama 35806
More information about the Bioperl-l
mailing list