[Bioperl-l] alignable portion of a genome
Joel Martin
j_martin at lbl.gov
Tue May 12 23:34:08 UTC 2009
Hello,
Doing this with hashes ends up being a little inefficient for
larger kmers like 35 in larger genomes. A suffix array tool
like 'tallymer' will tell you the unique/non-unique kmer counts
quickly, and as Aaron suggested generating fake reads based
on a reference and introduce errors into them so you can evaluate
how well they map back is a good strategy. maq has a command to do
that built in.
Joel
On Wed, May 13, 2009 at 09:27:57AM +1200, Smithies, Russell wrote:
> Adding the mutations is a little hacky (and probably slow) but I think it works correctly.
> The stats should work out OK but it's too early and I haven't had a coffee yet so can't be sure :-)
>
> --Russell
>
> ============================
> #!perl -w
>
> my $seq = "atcgacgatcgaacgatcga";
> my $debug = 0;
>
>
> foreach ($seq =~ /(?=(\w{5}))/g){
> $h++;
> # add all the exact words to the hash
> $hash{$_}++;
> print "$_\n" if $debug;
> # mutate words and add to hash
> my at rr = mutate($_);
> foreach (@rr){
> print "$_\n" if $debug;
> $h++;
> $hash{$_}++;
> }
> }
>
>
> # print out the hash counts & stats
> foreach (keys %hash){
> print "$_\t$hash{$_}\n" if $debug;
> $singles++ if($hash{$_} eq 1);
> }
> print $singles/$h,"\n";
>
>
> sub mutate{
> my @array = split '',shift;
> my @res = ();
> my $rep = 'X';
> for(my$i = 0; $i <= $#array; $i++){
> my $old1 = $array[$i];
> splice @array, $i, 1, $rep;
> push @res, (join '', @array);
> for(my$j = $i+1; $j <= $#array; $j++){
> my $old2 = $array[$j];
> splice @array, $j, 1, $rep;
> push @res, (join '', @array);
> splice @array, $j, 1, $old2;
> }
> splice @array, $i, 1, $old1;
> }
> return @res;
> }
>
> ================================
>
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> > Sent: Tuesday, 12 May 2009 3:56 p.m.
> > To: 'fadista'; 'Bioperl-l at lists.open-bio.org'
> > Subject: Re: [Bioperl-l] alignable portion of a genome
> >
> > Perfect matches is easy:
> >
> > $seq = "atcgacgatcgaacgatcga";
> >
> > foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++}
> > foreach (keys %hash){ $singles++ if($hash{$_} eq 1)}
> > print $singles/$h;
> >
> > Could probably be done with map as well.
> > Counting the miss-matches might take a bit more thinking....
> > Any ideas MAJ?
> >
> > --Russell
> >
> >
> > > -----Original Message-----
> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > bounces at lists.open-bio.org] On Behalf Of fadista
> > > Sent: Monday, 11 May 2009 9:32 p.m.
> > > To: Bioperl-l at lists.open-bio.org
> > > Subject: [Bioperl-l] alignable portion of a genome
> > >
> > >
> > > Hi,
> > >
> > > I would like to know of a good and fast way that could help me calculate the
> > > alignable portion of a genome (not human), given a reference sequence.
> > > When I say alignable portion I mean that I want to know all the positions of
> > > the genome that can be covered uniquely by reads of 36 bp and up to 2
> > > mismatches.
> > >
> > > Some have advised me to work with Perl using the following strategy but I am
> > > not a Perl user so if someone has already a script for this function, it
> > > would be nice:
> > >
> > > "you could approach it by walking along the genome in a sliding window of
> > > 36 nt, and hash the frequency of each 36 nt sequence that you encounter.
> > > Then count how many of the 36 nt sequences had a frequency of exactly
> > > one. Divide this by the total number of 36nt windows visited. This
> > > should be do-able in about 20 lines of Perl."
> > >
> > >
> > > Best regards and thanks in advance
> > >
> > > --
> > > View this message in context: http://www.nabble.com/alignable-portion-of-a-
> > > genome-tp23480025p23480025.html
> > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list