[Bioperl-l] randomizing fastq sequences

Tue Feb 8 09:08:37 UTC 2011

If memory is an issue then I guess you could create a file of just the 
sequence IDs (one per line), then shuffle those (using List::Util like 
Simon demonstrated). In the end you would substitute the IDs for the 
whole fastq entry again, which you can do without reading an entire file 
into memory (might be bit slow but that probably doesn't
matter)
Frank

simon andrews (BI) wrote:
> On 7 Feb 2011, at 22:07, shalu sharma wrote:
>
>   
>> Hi,
>>   i am trying to test one program for which i need to change order of
>> sequences in a fastq file.
>> My fastq file contains about 50,000 sequences.
>> Is there any script that can do this task?
>>     
>
> Since FastQ is supported in SeqIO you could do something like (untested):
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use List::Util 'shuffle';
> use Bio::SeqIO;
>
> my @seqs;
>
> my $in = Bio::SeqIO->new(-file => 'your_intput.fastq',
> 			 -format => 'Fastq');
>
> while (my $seq = $in -> next_seq()) {
>     push @seqs,$seq;
> }
>
> @seqs = shuffle(@seqs);
>
> my $out = Bio::SeqIO->new(-file => '>your_output.fastq',
> 			  -format => 'Fastq');
>
> foreach my $seq (@seqs) {
>     $out->write_seq($seq);
> }
>
> ## End
>
> This has the disadvantage that it will hold all of the sequences in memory whilst shuffling, but I don't think there's an easy way around that.
>
> Simon.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>   

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.