[Bioperl-l] Select random sequences from a fasta file

shalabh sharma shalabh.sharma7 at gmail.com
Wed Mar 21 20:38:02 UTC 2012


Thanks a lot Jason,
                            I will look in to this, will also try celera
assembler approach.

Thanks
Shalabh


On Wed, Mar 21, 2012 at 4:07 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> Hi -
>
> If they are short reads and just the same length (e.g. one line per
> sequence) you can do this in plain perl with seek and a RNG to read 2 lines
> from the file.
> The problem in trying to do this in bioperl is the indexing of the
> multifasta file ends up being really slow when you get past ~4-5M IDs in
> the hash structure that is used. Plus there isn't a nice way to do this
> random selection other than to generate the full list of IDs and do the
> shuffling and pop off a few thousand to do the lookup.  I think this is
> pretty way overkill for the problem you are trying to solve.
>
> There is a nice utility to do this as part of the Celera Assembler - if
> you use the gatekeeper tool there is an option after you build a store to
> then get a dump of a random subselection of the data.
>
> Jason
> On Mar 21, 2012, at 12:42 PM, shalabh sharma wrote:
>
> Hi All,
>          Is there a way to select random sequences from a multi fasta
> file. I am using some method (not that sophisticated).
> Is there any module in bioperl that can do that?
>
> I have a fasta file containing around 10 million reads, and i want to get
> few thousand sequences out of it (randomly selected).
>
> Thanks
> Shalabh
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636



More information about the Bioperl-l mailing list