[Bioperl-l] counting gaps in sequence data

Thu Oct 14 18:05:57 EDT 2004

I have a set of data that looks something like the following:

 >human
acgtt---cgatacg---acgact-----t
 >chimp
acgtacgatac---actgca---ac
 >mouse
acgata---acgatcg----acgt

I am having trouble setting up a hash etc., to count the number and 
types of continuous gaps. For example the 'human' sequence above has 2 
sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps 
and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.

So, I am having trouble being able to assign a dynamic variable (i.e. 
gap length) and place that in a pattern match so that it can count how 
many gaps of that length are in that particular sequence. I know how to 
set up a hash to count the number of times a gap appears: 
'$gaptype{$gap}++' or something. The problem is: what is the best way 
(and how) can I set '$gap' to be dynamic.

I need to know the length of each consecutive string of gaps. I know 
how to count the gaps by using the 'tr' function. But it gets confusing 
when I need to add counts to every instance of that gap length. I also 
need to know the position of each gap (denoted by the position of the 
first gap in that particular instance). I know that I can use the 
'pos()' command for this.

So, my problem is that I think I know some of the bits of code to put 
into place the problem is I am getting lost on how to structure it all 
together. For now I am just trying to get my output to look like this:

Human
number of 3 base pair gaps:		2
			at positions:		6, 16
number of 5 base pair gaps:		1
			at positions:		25

Chimp
.... and so on ...

So, any suggestions would be greatly appreciated. If anyone can help me 
out with all or even just bits of this I would greatly appreciate it. 
This should help me get started on some more advanced parsing I need to 
do after this. I like to try and figure things out on my own if I can, 
so even pseudo code would be of great help!

-Thanks
-Mike