[Bioperl-l] randomizing fastq sequences
Frank Schwach
fs5 at sanger.ac.uk
Tue Feb 8 06:57:12 EST 2011
nice one - but if I understand it correctly it relies on there being
exactly 4 lines for each record. This is probably the case but it would
be a good idea to double-check the fastq file in question, just to make
sure.
Frank
Roy Chaudhuri wrote:
> TMTOWTDI, maybe also use the Tie::File module?
>
> Something like:
>
> #!/usr/bin/perl
> use warnings FATAL=>qw(all);
> use Modern::Perl;
> use Tie::File;
> use Fcntl qw(O_RDONLY);
> use List::Util qw(shuffle);
> my @fastq;
> tie @fastq, 'Tie::File', $ARGV[0], mode=>O_RDONLY or die $!;
> say join "\n", @fastq[4*$_..4*$_+3] for shuffle 0..$#fastq/4;
>
> Cheers,
> Roy.
>
> On 08/02/2011 09:08, Frank Schwach wrote:
>> If memory is an issue then I guess you could create a file of just the
>> sequence IDs (one per line), then shuffle those (using List::Util like
>> Simon demonstrated). In the end you would substitute the IDs for the
>> whole fastq entry again, which you can do without reading an entire file
>> into memory (might be bit slow but that probably doesn't
>> matter)
>> Frank
>>
>>
>> simon andrews (BI) wrote:
>>> On 7 Feb 2011, at 22:07, shalu sharma wrote:
>>>
>>>
>>>> Hi,
>>>> i am trying to test one program for which i need to change order of
>>>> sequences in a fastq file.
>>>> My fastq file contains about 50,000 sequences.
>>>> Is there any script that can do this task?
>>>>
>>>
>>> Since FastQ is supported in SeqIO you could do something like
>>> (untested):
>>>
>>> #!/usr/bin/perl
>>> use warnings;
>>> use strict;
>>> use List::Util 'shuffle';
>>> use Bio::SeqIO;
>>>
>>> my @seqs;
>>>
>>> my $in = Bio::SeqIO->new(-file => 'your_intput.fastq',
>>> -format => 'Fastq');
>>>
>>> while (my $seq = $in -> next_seq()) {
>>> push @seqs,$seq;
>>> }
>>>
>>> @seqs = shuffle(@seqs);
>>>
>>> my $out = Bio::SeqIO->new(-file => '>your_output.fastq',
>>> -format => 'Fastq');
>>>
>>> foreach my $seq (@seqs) {
>>> $out->write_seq($seq);
>>> }
>>>
>>> ## End
>>>
>>> This has the disadvantage that it will hold all of the sequences in
>>> memory whilst shuffling, but I don't think there's an easy way
>>> around that.
>>>
>>> Simon.
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioperl-l
mailing list