[Bioperl-l] Finding locations of a string within a fasta file
torsten.seemann at infotech.monash.edu.au
Mon Jul 17 02:21:31 UTC 2006
> I'm trying to determine where (the start .. end positions) within a
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
> Suggestions on how to easily retrieve this would be appreciated.
First you need to get the sequence into a string within Perl. As your
email Subject: says it is in the Fasta file, you need to
1. open the fasta file - see Bio::SeqIO
2. read first sequence (as an object) - see next_seq()
3. get the string of the sequence in the object - see seq()
Then you could just use the inbuilt Perl function index() to loop
through all the occurences of 'N' - type 'perldoc -f index' for help.
Alternatively use regexp matching eg, m/(N+)/g and the pos() function.
Dr Torsten Seemann http://www.vicbioinformatics.com
Victorian Bioinformatics Consortium, Monash University, Australia
More information about the Bioperl-l