[Bioperl-l] Finding locations of a string within a fasta file

Mon Jul 17 02:21:31 UTC 2006

> I'm trying to determine where (the start .. end positions) within a  
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
> Suggestions on how to easily retrieve this would be appreciated.

First you need to get the sequence into a string within Perl. As your 
email Subject: says it is in the Fasta file, you need to

1. open the fasta file - see Bio::SeqIO
2. read first sequence (as an object) - see next_seq()
3. get the string of the sequence in the object - see seq()

Then you could just use the inbuilt Perl function index() to loop 
through all the occurences of 'N' - type 'perldoc -f index' for help.

Alternatively use regexp matching eg, m/(N+)/g and the pos() function.

-- 
Dr Torsten Seemann               http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia