[Bioperl-l] alignable portion of a genome

Tue May 12 23:34:08 UTC 2009

Hello,
   Doing this with hashes ends up being a little inefficient for
larger kmers like 35 in larger genomes.  A suffix array tool
like 'tallymer' will tell you the unique/non-unique kmer counts
quickly, and as Aaron suggested generating fake reads based
on a reference and introduce errors into them so you can evaluate
how well they map back is a good strategy.  maq has a command to do
that built in.

Joel

On Wed, May 13, 2009 at 09:27:57AM +1200, Smithies, Russell wrote:
> Adding the mutations is a little hacky (and probably slow) but I think it works correctly.
> The stats should work out OK but it's too early and I haven't had a coffee yet so can't be sure :-)
> 
> --Russell
> 
> ============================
> #!perl -w
> 
> my $seq = "atcgacgatcgaacgatcga";
> my $debug = 0;
> 
> 
> foreach ($seq =~ /(?=(\w{5}))/g){
> 	$h++; 
> 	# add all the exact words to the hash
> 	$hash{$_}++;
> 	print "$_\n" if $debug;
> 	# mutate words and add to hash
> 	my at rr = mutate($_);
> 	foreach (@rr){
> 		print "$_\n" if $debug;
> 		$h++;
> 	 	$hash{$_}++;
> 	}
> }
> 
> 
> # print out the hash counts & stats
> foreach (keys %hash){
> 	print "$_\t$hash{$_}\n" if $debug;
> 	$singles++ if($hash{$_} eq 1);
> }
> print $singles/$h,"\n";
> 
> 
> sub mutate{
> 	my @array = split '',shift;
> 	my @res = ();
> 	my $rep = 'X';
> 	for(my$i = 0; $i <= $#array; $i++){
> 		my $old1 = $array[$i];
> 		splice @array, $i, 1, $rep;
> 		push @res, (join '', @array);
> 		for(my$j = $i+1; $j <= $#array; $j++){
> 			my $old2 = $array[$j];
> 			splice @array, $j, 1, $rep;
> 			push @res, (join '', @array);
> 			splice @array, $j, 1, $old2;
> 		}
> 		splice @array, $i, 1, $old1;
> 	}
> 	return @res;
> }
> 
> ================================
> 
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> > Sent: Tuesday, 12 May 2009 3:56 p.m.
> > To: 'fadista'; 'Bioperl-l at lists.open-bio.org'
> > Subject: Re: [Bioperl-l] alignable portion of a genome
> > 
> > Perfect matches is easy:
> > 
> > 	$seq = "atcgacgatcgaacgatcga";
> > 
> > 	foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++}
> > 	foreach (keys %hash){ $singles++ if($hash{$_} eq 1)}
> > 	print $singles/$h;
> > 
> > Could probably be done with map as well.
> > Counting the miss-matches might take a bit more thinking....
> > Any ideas MAJ?
> > 
> > --Russell
> > 
> > 
> > > -----Original Message-----
> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > bounces at lists.open-bio.org] On Behalf Of fadista
> > > Sent: Monday, 11 May 2009 9:32 p.m.
> > > To: Bioperl-l at lists.open-bio.org
> > > Subject: [Bioperl-l] alignable portion of a genome
> > >
> > >
> > > Hi,
> > >
> > > I would like to know of a good and fast way that could help me calculate the
> > > alignable portion of a genome (not human), given a reference sequence.
> > > When I say alignable portion I mean that I want to know all the positions of
> > > the genome that can be covered uniquely by reads of 36 bp and up to 2
> > > mismatches.
> > >
> > > Some have advised me to work with Perl using the following strategy but I am
> > > not a Perl user so if someone has already a script for this function, it
> > > would be nice:
> > >
> > > "you could approach it by walking along the genome in a sliding window of
> > > 36 nt, and hash the frequency of each 36 nt sequence that you encounter.
> > > Then count how many of the 36 nt sequences had a frequency of exactly
> > > one. Divide this by the total number of 36nt windows visited. This
> > > should be do-able in about 20 lines of Perl."
> > >
> > >
> > > Best regards and thanks in advance
> > >
> > > --
> > > View this message in context: http://www.nabble.com/alignable-portion-of-a-
> > > genome-tp23480025p23480025.html
> > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l