[Bioperl-l] alignable portion of a genome

Wed May 13 13:47:58 EDT 2009

Just to be clear, my suggestion was to realign the genome-derived reads
without any additional errors introduced.  By not adding any other errors,
you'll still get to find out which regions of the genome are unique even
while accepting 2 mutations (or however you setup your alignment filtering).

Russell's hash method also won't work well when the number of mutations
increases to, say, 10 out of 100bp reads (which is what I'm dealing with
right now, as a matter of fact ...), and certainly won't take into account
the ability of using paired end reads to render an ambiguous alignment
unique based on appropriate location relative to a uniquely aligned mate
pair.

So for all these reasons, I still argue that the best way to accurately and
meaningfully measure the alignable portion of a genome is to simply align
the genome (without errors) back to itself using your alignment
method/filters of choice.  If you're doing paired-end reads, you'll still
have to simulate that aspect (don't just take reads spaced exactly by 200 bp
-- instead, you should use the emperical distribution of read distances seen
in your data, and repeat the simulation-alignment a few times to see how
stable your final estimates are).

-Aaron

On Tue, May 12, 2009 at 7:34 PM, Joel Martin <j_martin at lbl.gov> wrote:

> Hello,
>   Doing this with hashes ends up being a little inefficient for
> larger kmers like 35 in larger genomes.  A suffix array tool
> like 'tallymer' will tell you the unique/non-unique kmer counts
> quickly, and as Aaron suggested generating fake reads based
> on a reference and introduce errors into them so you can evaluate
> how well they map back is a good strategy.  maq has a command to do
> that built in.
>
> Joel
>
> On Wed, May 13, 2009 at 09:27:57AM +1200, Smithies, Russell wrote:
> > Adding the mutations is a little hacky (and probably slow) but I think it
> works correctly.
> > The stats should work out OK but it's too early and I haven't had a
> coffee yet so can't be sure :-)
> >
> > --Russell
> >
> > ============================
> > #!perl -w
> >
> > my $seq = "atcgacgatcgaacgatcga";
> > my $debug = 0;
> >
> >
> > foreach ($seq =~ /(?=(\w{5}))/g){
> >       $h++;
> >       # add all the exact words to the hash
> >       $hash{$_}++;
> >       print "$_\n" if $debug;
> >       # mutate words and add to hash
> >       my at rr = mutate($_);
> >       foreach (@rr){
> >               print "$_\n" if $debug;
> >               $h++;
> >               $hash{$_}++;
> >       }
> > }
> >
> >
> > # print out the hash counts & stats
> > foreach (keys %hash){
> >       print "$_\t$hash{$_}\n" if $debug;
> >       $singles++ if($hash{$_} eq 1);
> > }
> > print $singles/$h,"\n";
> >
> >
> > sub mutate{
> >       my @array = split '',shift;
> >       my @res = ();
> >       my $rep = 'X';
> >       for(my$i = 0; $i <= $#array; $i++){
> >               my $old1 = $array[$i];
> >               splice @array, $i, 1, $rep;
> >               push @res, (join '', @array);
> >               for(my$j = $i+1; $j <= $#array; $j++){
> >                       my $old2 = $array[$j];
> >                       splice @array, $j, 1, $rep;
> >                       push @res, (join '', @array);
> >                       splice @array, $j, 1, $old2;
> >               }
> >               splice @array, $i, 1, $old1;
> >       }
> >       return @res;
> > }
> >
> > ================================
> >
> > > -----Original Message-----
> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> > > Sent: Tuesday, 12 May 2009 3:56 p.m.
> > > To: 'fadista'; 'Bioperl-l at lists.open-bio.org'
> > > Subject: Re: [Bioperl-l] alignable portion of a genome
> > >
> > > Perfect matches is easy:
> > >
> > >     $seq = "atcgacgatcgaacgatcga";
> > >
> > >     foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++}
> > >     foreach (keys %hash){ $singles++ if($hash{$_} eq 1)}
> > >     print $singles/$h;
> > >
> > > Could probably be done with map as well.
> > > Counting the miss-matches might take a bit more thinking....
> > > Any ideas MAJ?
> > >
> > > --Russell
> > >
> > >
> > > > -----Original Message-----
> > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > > bounces at lists.open-bio.org] On Behalf Of fadista
> > > > Sent: Monday, 11 May 2009 9:32 p.m.
> > > > To: Bioperl-l at lists.open-bio.org
> > > > Subject: [Bioperl-l] alignable portion of a genome
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I would like to know of a good and fast way that could help me
> calculate the
> > > > alignable portion of a genome (not human), given a reference
> sequence.
> > > > When I say alignable portion I mean that I want to know all the
> positions of
> > > > the genome that can be covered uniquely by reads of 36 bp and up to 2
> > > > mismatches.
> > > >
> > > > Some have advised me to work with Perl using the following strategy
> but I am
> > > > not a Perl user so if someone has already a script for this function,
> it
> > > > would be nice:
> > > >
> > > > "you could approach it by walking along the genome in a sliding
> window of
> > > > 36 nt, and hash the frequency of each 36 nt sequence that you
> encounter.
> > > > Then count how many of the 36 nt sequences had a frequency of exactly
> > > > one. Divide this by the total number of 36nt windows visited. This
> > > > should be do-able in about 20 lines of Perl."
> > > >
> > > >
> > > > Best regards and thanks in advance
> > > >
> > > > --
> > > > View this message in context:
> http://www.nabble.com/alignable-portion-of-a-
> > > > genome-tp23480025p23480025.html
> > > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l at lists.open-bio.org
> > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > > =======================================================================
> > > Attention: The information contained in this message and/or attachments
> > > from AgResearch Limited is intended only for the persons or entities
> > > to which it is addressed and may contain confidential and/or privileged
> > > material. Any review, retransmission, dissemination or other use of, or
> > > taking of any action in reliance upon, this information by persons or
> > > entities other than the intended recipients is prohibited by AgResearch
> > > Limited. If you have received this message in error, please notify the
> > > sender immediately.
> > > =======================================================================
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>