[Bioperl-l] QUERY: Correct use of Bio::SeqIO::largefasta?
Dave Messina
David.Messina at sbc.su.se
Mon Jul 27 06:13:27 EDT 2009
Hi Mark,
I'm not familiar with the TFBS code, but these errors you're getting:
Output from 1.7 Megabase DNA fragment:
>
> GET_SEQUENCE: Sequence too long.
> LOOP_ON_SEQS: get_sequence failed.
> MAIN: loop_on_seqs failed.
are coming from the C code that's part of the TFBS package.
So I think your inclination to chop up the longer sequences is correct. I
modified your code to do that (see below).
Also, you might contact the author of the TFBS package (Boris.Lenhard at
bccs.uib.no) and see if he has any suggestions for handling longer sequences
with his code.
Dave
------- CUT HERE ------
#!/usr/bin/perl
use strict;
use warnings;
use TFBS::Matrix::PFM;
use Bio::Seq;
use Bio::SeqIO;
my $matrixref = [
[ 5, 5, 5, 5, 5, 5, 85, 5, 5, 5, 5, 85 ],
[ 5, 5, 5, 5, 5, 85, 5, 5, 5, 5, 85, 5 ],
[ 5, 5, 85, 85, 5, 5, 5, 85, 5, 85, 5, 5 ],
[ 85, 85, 5, 5, 85, 5, 5, 5, 85, 5, 5, 5 ]
];
my $chunklength = 175000; # set this to whatever you want your chunk size
to be
my $pfm = TFBS::Matrix::PFM->new(
-matrix => $matrixref,
-name => "CeRep_matrix_1",
-ID => "M1000"
);
my $pwm = $pfm->to_PWM(); # convert to position weight matrix
my $stream = Bio::SeqIO->new(
-format => 'fasta',
-fh => \*ARGV
);
while ( my $seq = $stream->next_seq() ) {
my $seqlength = $seq->length();
for ( my $i = 1 ; $i <= $seqlength ; $i += $chunklength ) {
my $start = $i;
my $end = $start + $chunklength - 1;
if ( $end > $seqlength ) { $end = $seqlength; }
my $subseq = $seq->subseq( $start, $end );
my $display_id = $seq->display_id . '/' . $start . '-' . $end;
my $chunk = Bio::Seq->new(
-seq => $subseq,
-display_id => $display_id,
-alphabet => $seq->alphabet,
);
my $siteset = $pwm->search_seq(
-seqobj => $chunk,
-threshold => " 75 %",
);
print $siteset->GFF();
}
}
More information about the Bioperl-l
mailing list