[Bioperl-l] fastq splitter

Tue Feb 28 18:50:25 EST 2012

$ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
1.006001

Isn't that 1.6.1 - does it need upgrading ?

On 28 February 2012 18:36, Sean O'Keeffe <limericksean at gmail.com> wrote:

> Could be. I'll check.
>
> On 28 February 2012 17:17, Fields, Christopher J <cjfields at illinois.edu>wrote:
>
>> That's a bit odd.  Are you using an old version of the FASTQ parser?  It
>> was revised a while ago, prior to the v1.6.1 release (the error matches one
>> in the older parser)
>>
>> chris
>>
>> On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
>>
>> > Hi Chris,
>> > Unfortunately the read pairs are not consecutive. It seems they are
>> cat'd together.
>> > I could use split -l on the line number that they're glued together I
>> guess.
>> > If this is an overnight job for a bunch of files, I can wait so don't
>> mind using the module if it worked.
>> >
>> > Someone pointed out I need to switch $seqin->desc to $inseq->desc.
>> > However, now it spits out fasta output instead of fastq and returns a
>> bunch of warnings: Seq/Qual descriptions don't match; using sequence
>> description
>> >
>> > Hmm.
>> >
>> > On 28 February 2012 16:50, Fields, Christopher J <cjfields at illinois.edu>
>> wrote:
>> > Sean,
>> >
>> > If you trust the data enough, in that:
>> >
>> > 1) each record is 4 lines,
>> > 2) mate pairs are consecutive in the file, and
>> > 3) that read 1 always preceeds read 2 in the pair,
>> >
>> > then I would simply iterate through 4 lines at a time and dump to the
>> two separate files, maybe using a flip-flop or simple record count and
>> modulus switch.  You can always run a check on the header with a regex if
>> you don't trust it completely.
>> >
>> > Just from the sanity point-of-view, unless you're doing a lot of
>> validation I wouldn't use Bio::SeqIO::fastq, unless you have some time on
>> your hands and a relatively low number of seqs (it's notoriously slow at
>> the moment).
>> >
>> > chris
>> >
>> > On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
>> >
>> > > Hi,
>> > > I'm trying to write a quick script to separate one large PE fastq
>> file into
>> > > 2 separate files, one for each mate pair
>> > >
>> > > The file is of the format (mate1)
>> > > @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
>> > > CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
>> > > +
>> > > BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
>> > >
>> > > && (mate2)
>> > >
>> > > @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
>> > > TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
>> > > +
>> > > ##################################################
>> > >
>> > >
>> > > My idea is to separate using a regex such that / 1:/ would be the
>> first
>> > > mate pair and / 2:/ would go in the second mate file.
>> > > I implemented the code below but each output file is empty. Can
>> someone
>> > > spot my error?
>> > >
>> > > Thanks,
>> > > Sean.
>> > >
>> > > my $infile   = shift;
>> > > my $outfile1 = $infile."_1";
>> > > my $outfile2 = $infile."_2";
>> > >
>> > > my $seqin = Bio::SeqIO->new(
>> > >                             -file   => "<$infile",
>> > >                             -format => "fastq",
>> > >                             );
>> > > my $seqout1 = Bio::SeqIO->new(
>> > >                              -file   => ">$outfile1",
>> > >                              -format => "fastq",
>> > >                              );
>> > >
>> > > my $seqout2 = Bio::SeqIO->new(
>> > >                              -file   => ">$outfile2",
>> > >                              -format => "fastq",
>> > >                              );
>> > > while (my $inseq = $seqin->next_seq) {
>> > >    if ($seqin->desc =~ / 1:/){
>> > >      $seqout1->write_seq($inseq);
>> > >    } else {
>> > >      $seqout2->write_seq($inseq);
>> > >    }
>> > > }
>> > > _______________________________________________
>> > > Bioperl-l mailing list
>> > > Bioperl-l at lists.open-bio.org
>> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>> >
>>
>>
>