[Bioperl-l] fastq splitter

Wed Feb 29 11:30:05 EST 2012

Hi Chris,
Here's the perldoc for fastq - it does seem to indicate that the optional
descriptor (+) must match the first header. (See DESCRIPTION).
Maybe this version doesn't include the updated code you mention which Peter
Cock has worked on.

Sean.

======================================

Bio::SeqIO::fastq(3)  User Contributed Perl Documentation
Bio::SeqIO::fastq(3)

NAME
       Bio::SeqIO::fastq - fastq sequence input/output stream

SYNOPSIS
         ################## pertains to FASTQ parsing only
##################

         # grabs the FASTQ parser, specifies the Illumina variant
         my $in = Bio::SeqIO->new(-format    => ’fastq-illumina’,
                                  -file      => ’mydata.fq’);

         # simple ’fastq’ format defaults to ’sanger’ variant
         my $out = Bio::SeqIO->new(-format    => ’fastq’,
                                  -file      => ’>mydata.fq’);

         # $seq is a Bio::Seq::Quality object
         while (my $seq = $in->next_seq) {
             $out->write_seq($seq);  # convert Illumina 1.3 to Sanger format
         }

         # for 4x faster parsing, one can do something like this for raw
data
         use Bio::Seq::Quality;

         # $data is a hash reference containing all arguments to be passed
to
         # the Bio::Seq::Quality constructor
         while (my $data = $in->next_dataset) {
             # process $data, such as trim, etc
             my $seq = Bio::Seq::Quality->new(%$data);

             # for now, write_seq only accepts Bio::Seq::Quality, but may
be modified
             # to allow raw hash references for speed
             $out->write_seq($data);
         }

DESCRIPTION
       This object can transform Bio::Seq and Bio::Seq::Quality objects to
and from FASTQ flat file databases.

       FASTQ is a file format used frequently at the Sanger Centre and in
next-gen sequencing to bundle a FASTA sequence and its quality data. A
typical FASTQ entry takes the from:

         @HCDPQ1D0501
         GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT.....
         +HCDPQ1D0501
         !’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65.....

       where:

         @ = descriptor, followed by one or more sequence lines
         + = optional descriptor (if present, must match first one),
followed by one or
             more qual lines

       FASTQ and Bio::Seq::Quality mapping

       FASTQ files have sequence and quality data on single line or
multiple lines, and the quality values are single-byte encoded. Data are
mapped very simply to Bio::Seq::Quality instances:

           Data                                        Bio::Seq::Quality
method

------------------------------------------------------------------------
           first non-whitespace chars in descriptor    id^
           descriptor line                             desc^
           sequence lines                              seq
           quality                                     qual*
           FASTQ variant                               namespace

           ^ first nonwhitespace chars are id(), everything else after (to
end of line)
             is in desc()
           * Converted to PHRED quality scores where applicable (’solexa’)

       FASTQ variants

       This parser supports all variants of FASTQ, including Illumina v 1.0
and 1.3:

           variant                note
           -----------------------------------------------------------
           sanger                 original
           solexa                 Solexa, Inc. (2004), aka Illumina 1.0
           illumina               Illumina 1.3

       The variant can be specified by passing by either passing the
additional -variant parameter to the constructor:

         my $in = Bio::SeqIO->new(-format    => ’fastq’,
                                  -variant   => ’solexa’,
                                  -file      => ’mysol.fq’);

       or by passing the format and variant together (Bio::SeqIO will now
handle this and convert it accordingly to the proper argument):

         my $in = Bio::SeqIO->new(-format    => ’fastq-solexa’,
                                  -file      => ’mysol.fq’);

       Variants can be converted back and forth from one another; however,
due to the difference in scaling for solexa quality reads, converting from
’illumina’ or ’sanger’ FASTQ to solexa is not recommended.
....

============================================

On 28 February 2012 21:40, Fields, Christopher J <cjfields at illinois.edu>wrote:

> That should work.  Can you send the output of 'perldoc Bio::SeqIO::fastq'?
>  That should indicate what is being called.
>
> chris
>
> On Feb 28, 2012, at 5:50 PM, Sean O'Keeffe wrote:
>
> > $ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
> > 1.006001
> >
> > Isn't that 1.6.1 - does it need upgrading ?
> >
> > On 28 February 2012 18:36, Sean O'Keeffe <limericksean at gmail.com> wrote:
> > Could be. I'll check.
> >
> > On 28 February 2012 17:17, Fields, Christopher J <cjfields at illinois.edu>
> wrote:
> > That's a bit odd.  Are you using an old version of the FASTQ parser?  It
> was revised a while ago, prior to the v1.6.1 release (the error matches one
> in the older parser)
> >
> > chris
> >
> > On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
> >
> > > Hi Chris,
> > > Unfortunately the read pairs are not consecutive. It seems they are
> cat'd together.
> > > I could use split -l on the line number that they're glued together I
> guess.
> > > If this is an overnight job for a bunch of files, I can wait so don't
> mind using the module if it worked.
> > >
> > > Someone pointed out I need to switch $seqin->desc to $inseq->desc.
> > > However, now it spits out fasta output instead of fastq and returns a
> bunch of warnings: Seq/Qual descriptions don't match; using sequence
> description
> > >
> > > Hmm.
> > >
> > > On 28 February 2012 16:50, Fields, Christopher J <
> cjfields at illinois.edu> wrote:
> > > Sean,
> > >
> > > If you trust the data enough, in that:
> > >
> > > 1) each record is 4 lines,
> > > 2) mate pairs are consecutive in the file, and
> > > 3) that read 1 always preceeds read 2 in the pair,
> > >
> > > then I would simply iterate through 4 lines at a time and dump to the
> two separate files, maybe using a flip-flop or simple record count and
> modulus switch.  You can always run a check on the header with a regex if
> you don't trust it completely.
> > >
> > > Just from the sanity point-of-view, unless you're doing a lot of
> validation I wouldn't use Bio::SeqIO::fastq, unless you have some time on
> your hands and a relatively low number of seqs (it's notoriously slow at
> the moment).
> > >
> > > chris
> > >
> > > On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
> > >
> > > > Hi,
> > > > I'm trying to write a quick script to separate one large PE fastq
> file into
> > > > 2 separate files, one for each mate pair
> > > >
> > > > The file is of the format (mate1)
> > > > @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
> > > > CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
> > > > +
> > > > BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
> > > >
> > > > && (mate2)
> > > >
> > > > @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
> > > > TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
> > > > +
> > > > ##################################################
> > > >
> > > >
> > > > My idea is to separate using a regex such that / 1:/ would be the
> first
> > > > mate pair and / 2:/ would go in the second mate file.
> > > > I implemented the code below but each output file is empty. Can
> someone
> > > > spot my error?
> > > >
> > > > Thanks,
> > > > Sean.
> > > >
> > > > my $infile   = shift;
> > > > my $outfile1 = $infile."_1";
> > > > my $outfile2 = $infile."_2";
> > > >
> > > > my $seqin = Bio::SeqIO->new(
> > > >                             -file   => "<$infile",
> > > >                             -format => "fastq",
> > > >                             );
> > > > my $seqout1 = Bio::SeqIO->new(
> > > >                              -file   => ">$outfile1",
> > > >                              -format => "fastq",
> > > >                              );
> > > >
> > > > my $seqout2 = Bio::SeqIO->new(
> > > >                              -file   => ">$outfile2",
> > > >                              -format => "fastq",
> > > >                              );
> > > > while (my $inseq = $seqin->next_seq) {
> > > >    if ($seqin->desc =~ / 1:/){
> > > >      $seqout1->write_seq($inseq);
> > > >    } else {
> > > >      $seqout2->write_seq($inseq);
> > > >    }
> > > > }
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l at lists.open-bio.org
> > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> > >
> >
> >
> >
>
>