[Bioperl-l] QUERY: Correct use of Bio::SeqIO::largefasta?
Marc Perry
Marc.Perry at oicr.on.ca
Thu Jul 23 01:07:45 EDT 2009
Hi,
I would like to use position weight matrices to scan chromosome sized sequences to identify variants of some of my favorite DNA elements. I was able to install the Bioperl compliant TFBS_0.5 modules and use the documentation to create different objects (e.g., PFM, PWM, ICM) and all of the methods I tested seemed to work as expected. However, when I tried to create sequence objects to search I seem to have hit a wall of sorts. Below is my script; almost all of it is cut and pasted from the TFBS::Matrix::PFM POD, or the PODs I cite in the comments. It works fine for a fasta file containing ~ 175,000 nt of worm DNA, but balks on fasta files of 1.7 Mb (and larger) [Output and Error messages included below]. Feeling rather new to Object oriented programming, I also ran the script through the Perl debugger (step-by-step) and convinced myself that Bio::SeqIO::largefasta was indeed opening filehandles in /tmp sub-directories. My reading of the PODs for LargeSeq, LargePrimarySeq, and LargeSeqI makes it sound like I should be manually chopping the chromosomes up into smaller, digestible chunks (probably with a loop?), but thus far I haven't been able to find any good examples that people have commonly used (and examples I have found don't seem to "chunk").
Further details:
Bioperl version 1.0069
Perl version 5.10.0
Linux Ubuntu 8.10 with 512 Mb of RAM (because I am running it as a virtual machine using VM-ware running on a Windows XP host).
Thanks in advance for your feedback,
--Marc Perry
Ontario Institute for Cancer Research
Code:
#!/usr/bin/perl
use warnings;
use strict;
use TFBS::Matrix::PFM;
use Bio::Seq;
use Bio::SeqIO;
my $matrixref = [ [ 5, 5, 5, 5, 5, 5, 85, 5, 5, 5, 5, 85],
[ 5, 5, 5, 5, 5, 85, 5, 5, 5, 5, 85, 5],
[ 5, 5, 85, 85, 5, 5, 5, 85, 5, 85, 5, 5],
[ 85, 85, 5, 5, 85, 5, 5, 5, 85, 5, 5, 5]
];
my $pfm = TFBS::Matrix::PFM->new(-matrix => $matrixref,
-name => "CeRep_matrix_1",
-ID => "M1000"
);
my $pwm = $pfm->to_PWM(); # convert to position weight matrix
my $stream = Bio::SeqIO->new(-format => 'largefasta',
-fh => \*ARGV); # from Bio::SeqIO POD
my $seq = $stream->next_seq(); # from Bio::SeqIO::largefasta POD
my $siteset = $pwm->search_seq(-seqobj => $seq, -threshold => "75%");
print $siteset->GFF();
exit;
Output from 1.7 Megabase DNA fragment:
GET_SEQUENCE: Sequence too long.
LOOP_ON_SEQS: get_sequence failed.
MAIN: loop_on_seqs failed.
Output from 175,000 bp DNA fragment:
CHROMOSOME_X TFBS TF binding site 33912 33923 12.950 - 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950" ; sequence ttgggcagagca
CHROMOSOME_X TFBS TF binding site 33988 33999 16.494 - 0 TF CeRep_matrix_1 ; class Unknown ; score "16.494" ; sequence ttggtcagtgta
CHROMOSOME_X TFBS TF binding site 74439 74450 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950" ; sequence ttaatcagtgca
CHROMOSOME_X TFBS TF binding site 74470 74481 16.494 + 0 TF CeRep_matrix_1 ; class Unknown ; score "16.494" ; sequence ttggtcagtgcg
CHROMOSOME_X TFBS TF binding site 74535 74546 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950" ; sequence ttggacagtgaa
CHROMOSOME_X TFBS TF binding site 74567 74578 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950" ; sequence atggtcagggca
CHROMOSOME_X TFBS TF binding site 103365 103376 12.950 + 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950" ; sequence tcggttagtgca
CHROMOSOME_X TFBS TF binding site 175608 175619 12.950 - 0 TF CeRep_matrix_1 ; class Unknown ; score "12.950" ; sequence ttggtctgttca
Data (truncated to conserve space):
>CHROMOSOME_X
ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct
aagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagc
ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct
aagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
gcctaagcctaatctgtgctccaaagccttcgaactgacggacttgtgtc
More information about the Bioperl-l
mailing list