[Bioperl-l] fastq splitter

Wed Feb 29 02:42:27 UTC 2012

Frankly, there never seemed to be a real fixed standard in the way that FASTQ headers were written (and just when it seems there is some consensus, Illumina pulls the rug out from under you), hence the reason I leave it alone.  We could add some ID munging in there if needed, would just need a qr// with a standard fallback.

chris

On Feb 29, 2012, at 4:33 PM, Florent Angly wrote:

> Also, the desc() method returns the part after the whitespace in the FASTA header.
> Hence, instead of / 1:/, your regular expression should not have the space and should be written /1:/. In fact, it would be even better (faster) it it were written as an anchored regular expression that matches only the beginning of the description, /^1:/
> 
> Note that you are apparently using the latest Illumina format, that does not follow previous convention on paired-end read headers. Hence your script will not work properly with non-latest-Illumina paired-end files.
> 
> Florent
> 
> 
> 
> On 29/02/12 07:26, Michael Muratet wrote:
>> 
>> On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
>> 
>>> Hi,
>>> I'm trying to write a quick script to separate one large PE fastq file into
>>> 2 separate files, one for each mate pair
>>> 
>>> The file is of the format (mate1)
>>> @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
>>> CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
>>> +
>>> BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
>>> 
>>> && (mate2)
>>> 
>>> @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
>>> TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
>>> +
>>> ##################################################
>>> 
>>> 
>>> My idea is to separate using a regex such that / 1:/ would be the first
>>> mate pair and / 2:/ would go in the second mate file.
>>> I implemented the code below but each output file is empty. Can someone
>>> spot my error?
>>> 
>>> Thanks,
>>> Sean.
>>> 
>>> my $infile   = shift;
>>> my $outfile1 = $infile."_1";
>>> my $outfile2 = $infile."_2";
>>> 
>>> my $seqin = Bio::SeqIO->new(
>>>                            -file   => "<$infile",
>>>                            -format => "fastq",
>>>                            );
>>> my $seqout1 = Bio::SeqIO->new(
>>>                             -file   => ">$outfile1",
>>>                             -format => "fastq",
>>>                             );
>>> 
>>> my $seqout2 = Bio::SeqIO->new(
>>>                             -file   => ">$outfile2",
>>>                             -format => "fastq",
>>>                             );
>>> while (my $inseq = $seqin->next_seq) {
>>>   if ($seqin->desc =~ / 1:/){
>> Hi Sean
>> 
>> You're using the desc operator on the stream, not the seq object.
>> 
>> Cheers
>> 
>> Mike
>> 
>>>     $seqout1->write_seq($inseq);
>>>   } else {
>>>     $seqout2->write_seq($inseq);
>>>   }
>>> }
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> Michael Muratet, Ph.D.
>> Senior Scientist
>> HudsonAlpha Institute for Biotechnology
>> mmuratet at hudsonalpha.org
>> (256) 327-0473 (p)
>> (256) 327-0966 (f)
>> 
>> Room 4005
>> 601 Genome Way
>> Huntsville, Alabama 35806
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l