[Bioperl-l] counting gaps in sequence data

Fri Oct 15 14:50:38 EDT 2004

Mike-

$gap is the current gap length that you are iterating through with the 
second for loop.  Call it 3 for now.  $gaptype{3} then represents the 
hash value for the gap of length 3.  We are storing the location of the 
current gap (of length 3) hit in that hash value (say for example 
$gaptype{3} = 14 where 14 is the location), however there can be more 
than one gap of length 3, so we want to do something like $gaptype{3} = 
14 and 21 and 65.  In other words we want the value of  $gaptype{3} to 
contain several values - an array.  Remember that hash values must be 
scalars, but we can reference an array with a scalar and set the hash 
value equal to that array reference.   You could put the location values 
into an array, @locations =  (14, 21, 65), and then set place a 
reference to that array into the hash value like this $gaptype{3} = 
\@locaitons, or you could just put an anonymous, unnamed array directly 
into the hash value like this $gaptype{3} = [14, 21, 65].  The square  
brackets  make that an anonymous array.  Since we don't have all these 
values at once, you do the same thing with push.  The terrible looking 
@{$gaptype{$gap}} thing dereferences an anonymous array pointed to by 
the hash value $gaptype{$gap}.  Push just sees the array behind that 
reference and pushes $-[0] + 2 onto that array.  $-[0] is the first 
value in the @- array (pg. 667 in your Programming Perl Book).  That 
array stores the begining of that last hit found by the last match (2 is 
added because $-[0] counts the first position of the string to be 
matched as 0 and also returns the position just before the match). A 
regex something like this should work /(\A|[atgc])-{$gap}([atgc]|\z)/g.  
Although I noticed that it messes up the location of the 3' gap calling 
it 2 instead of 1.  Apparently $-[0] returns 0 for both a first and 
second position match - go figure.

Barry

Michael S. Robeson II wrote:

> Wow, that seems to work pretty well. However, I am unsure of what the 
> following line means:
>
> push @{$gaptype{$gap}}, $-[0] + 2;
>
> Especially the  " $-[0] + 2" part of it. I understand that it is an 
> array but what is going on there is a little vague. Other than that I 
> pretty much understand the code. Also, about the part not being able 
> to match gaps at the end of a string will be a problem. I am currently 
> working off of what you've posted and seeing if I can fit (using a 
> character class I suppose) a "\Z", "\z", or "$" to match any gaps at 
> the end of a line.
>
> -Cheers!
> -Mike
>
>
> On Oct 14, 2004, at 17:32, Barry Moore wrote:
>
>> Mike-
>>
>> Something like this maybe?
>>
>> use strict;
>> use warnings;
>>
>> my %seqs = (human => "acgtt---cgatacg---acgact-----t",
>>            chimp => "acgtacgatac---actgca---ac",
>>            mouse => "acgata---acgatcg----acgt");
>>
>> for my $seq (keys %seqs) { # An array of your sequences
>>  print "\n\nThe $seq sequence has the following gaps:\n";
>>  my %gaptype;
>>  for my $gap (1..5) { # 5 or however large you want gaps to be counted
>>    while ($seqs{$seq} =~ /[atgc]-{$gap}[atgc]/g) { #notice that this 
>> won't catch terminal gaps
>>      #This creates a hash of arrays.  The arrays hold the locations 
>> of the
>>      #gaps, and the count of each gaptype is determined by the length 
>> of that array.
>>      push @{$gaptype{$gap}}, $-[0] + 2;
>>    }
>>    if (defined @{$gaptype{$gap}}) {
>>      my $positions = join ", ", @{$gaptype{$gap}};
>>      print "\tGap length $gap begining at positions:\t$positions\n";
>>    }
>>  }
>> }
>>
>> Barry Moore
>>
>>
>> Michael Robeson wrote:
>
>

-- 
Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT