[Bioperl-l] K-mer generating script

James Estill jestill at plantbio.uga.edu
Fri Dec 19 19:07:11 EST 2008


SeqIO works great for this. I've used something like the following. This is part of a larger program, so some of this not relevant to what you need ...
========================

$fasta_in = "your_file.fasta";
$k = 3; 

my $in_seq_num = 0;
my $inseq = Bio::SeqIO->new( -file => "<$fasta_in",
                 -format => 'fasta');


   while (my $seq = $inseq->next_seq) {

    $in_seq_num++;
    if ($in_seq_num == 2) {
        print "\a";
        die "Input file should be a single sequence record\n";
    }

    # Calculate base cooridate data
    my $seq_len = $seq->length();
    my $max_start = $seq->length() - $k;
    
    # Print some summary data
    print STDERR "\n==============================\n" if $verbose;
    print STDERR "SEQ LEN: $seq_len\n" if $verbose;
    print STDERR "MAX START: $max_start\n" if $verbose;
    print STDERR "==============================\n" if $verbose;
    
    # CREATE FASTA FILE OF ALL K LENGTH OLIGOS
    # IN THE INPUT SEQUENCE
    print STDERR "Creating oligo fasta file\n" if $verbose;
    open (FASTAOUT, ">$temp_fasta") ||
        die "Can not open temp fasta file:\n $temp_fasta\n";

    for ($i=0; $i<=$max_start; $i++) {

        $start_pos = $i + 1;
        $end_pos = $start_pos + $k - 1;

        my $oligo = $seq->subseq($start_pos, $end_pos);

        # Set counts array to zero
        $counts[$i] = 0;
        
        print FASTAOUT ">$start_pos\n";
        print FASTAOUT "$oligo\n";
        
    }

    close (FASTAOUT);

}

-- Jamie Estill
-- jestill at uga.edu
-- http://jestill.myweb.uga.edu
-- http://www.epernicus.com/people/jestill
  _____  

From: Blanchette, Marco [mailto:MAB at stowers-institute.org]
To: bioperl-l at lists.open-bio.org [mailto:bioperl-l at lists.open-bio.org]
Sent: Fri, 19 Dec 2008 18:25:27 -0500
Subject: [Bioperl-l] K-mer generating script

Dear all,
  
  Does anyone have a little function that I could use to generate all possible k-mer DNA sequences? For instance all possible 3-mer (AAA, AAT, AAC, AAG, etc...). I need something that I could input the value of k and get all possible sequences...
  
  I know that it's a problem that need to use recursive programming but I can't get my brain around the problem.
  
  Many thanks
  
  Marco
  --
  Marco Blanchette, Ph.D.
  Assistant Investigator
  Stowers Institute for Medical Research
  1000 East 50th St.
  
  Kansas City, MO 64110
  
  Tel: 816-926-4071
  Cell: 816-726-8419
  Fax: 816-926-2018
  
  _______________________________________________
  Bioperl-l mailing list
  Bioperl-l at lists.open-bio.org
  http://lists.open-bio.org/mailman/listinfo/bioperl-l
    


More information about the Bioperl-l mailing list