[Bioperl-l] QUERY: Correct use of Bio::SeqIO::largefasta?

Thu Jul 23 01:07:45 EDT 2009

Hi,

I would like to use position weight matrices to scan chromosome sized sequences to identify variants of some of my favorite DNA elements.  I was able to install the Bioperl compliant TFBS_0.5 modules and use the documentation to create different objects (e.g., PFM, PWM, ICM) and all of the methods I tested seemed to work as expected.  However, when I tried to create sequence objects to search I seem to have hit a wall of sorts.  Below is my script; almost all of it is cut and pasted from the TFBS::Matrix::PFM POD, or the PODs I cite in the comments.  It works fine for a fasta file containing ~ 175,000 nt of worm DNA, but balks on fasta files of 1.7 Mb (and larger) [Output and Error messages included below].  Feeling rather new to Object oriented programming, I also ran the script through the Perl debugger (step-by-step) and convinced myself that Bio::SeqIO::largefasta was indeed opening filehandles in /tmp sub-directories.  My reading of the PODs for LargeSeq, LargePrimarySeq, and LargeSeqI makes it sound like I should be manually chopping the chromosomes up into smaller, digestible chunks (probably with a loop?), but thus far I haven't been able to find any good examples that people have commonly used (and examples I have found don't seem to "chunk").

Further details:
Bioperl version 1.0069
Perl version 5.10.0
Linux Ubuntu 8.10 with 512 Mb of RAM (because I am running it as a virtual machine using VM-ware running on a Windows XP host).

Thanks in advance for your feedback,

--Marc Perry
Ontario Institute for Cancer Research

Code:

#!/usr/bin/perl

use warnings;
use strict;
use TFBS::Matrix::PFM;
use Bio::Seq;
use Bio::SeqIO;

my $matrixref = [ [  5,  5,  5,  5,  5,  5, 85,  5,  5,  5,  5, 85],
                           [  5,  5,  5,  5,  5, 85,  5,  5,  5,  5, 85,  5],
                           [  5,  5, 85, 85,  5,  5,  5, 85,  5, 85,  5,  5],
                           [ 85, 85,  5,  5, 85,  5,  5,  5, 85,  5,  5,  5]
                            ];

my $pfm = TFBS::Matrix::PFM->new(-matrix => $matrixref,
                                                        -name   => "CeRep_matrix_1",
                                                        -ID     => "M1000"
                                                        );

my $pwm = $pfm->to_PWM();                               # convert to position weight matrix

my $stream = Bio::SeqIO->new(-format => 'largefasta',
                                                -fh     => \*ARGV);                  # from Bio::SeqIO POD

my $seq = $stream->next_seq();                          # from Bio::SeqIO::largefasta POD

my $siteset = $pwm->search_seq(-seqobj => $seq, -threshold => "75%");

print $siteset->GFF();

exit;

Output from 1.7 Megabase DNA fragment:

GET_SEQUENCE:  Sequence too long.
LOOP_ON_SEQS:  get_sequence failed.
MAIN:  loop_on_seqs failed.

Output from 175,000 bp DNA fragment:

CHROMOSOME_X TFBS TF binding site 33912 33923 12.950 - 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950"  ; sequence ttgggcagagca
CHROMOSOME_X TFBS TF binding site 33988 33999 16.494 - 0 TF CeRep_matrix_1 ; class Unknown ; score "16.494"  ; sequence ttggtcagtgta
CHROMOSOME_X TFBS TF binding site 74439 74450 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950"  ; sequence ttaatcagtgca
CHROMOSOME_X TFBS TF binding site 74470 74481 16.494 + 0 TF CeRep_matrix_1 ; class Unknown ; score "16.494"  ; sequence ttggtcagtgcg
CHROMOSOME_X TFBS TF binding site 74535 74546 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950"  ; sequence ttggacagtgaa
CHROMOSOME_X TFBS TF binding site 74567 74578 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950"  ; sequence atggtcagggca
CHROMOSOME_X TFBS TF binding site 103365 103376 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950"  ; sequence tcggttagtgca
CHROMOSOME_X TFBS TF binding site 175608 175619 12.950 - 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950"  ; sequence ttggtctgttca

Data (truncated to conserve space):

>CHROMOSOME_X
ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct
aagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagc
ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct
aagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
gcctaagcctaatctgtgctccaaagccttcgaactgacggacttgtgtc