[Bioperl-l] How can I pull out all instances of a motif from a genome sequence and output them as a BED file?

Steve Chervitz sac at bioperl.org
Thu Jun 14 21:33:39 UTC 2007


This issue was discussed recently here. Check out this thread:

http://thread.gmane.org/gmane.comp.lang.perl.bio.general/15046/focus=15048

Some of the tools mentioned in the FAQ item Chris mentioned do not
report where the match occurred, only that a match occurred
(String::Approx, agrep), though some do report do report match
locations (fuzznuc, fuzzprot; not sure about TFBS).

My Bio::Tools::SeqPattern module does not even perform any matches, it
just encapsulates a regular expression for a nuc or protein motif and
knows how to handle ambiguity code expansion and reverse
complementing. The idea is that you can use this to convert a
biological sequence motif into a string suitable for use in a perl
regex. Adding a match() method to this module would be handy.

There an example script for it in examples/tools of the distro (which,
btw references an obsolete module, so it won't run as is -- I'll fix).

Steve

On 6/13/07, Chris Fields <cjfields at uiuc.edu> wrote:
> This is answered in the FAQ (sorry if the URL wraps, but we don't
> like tinyurls):
>
> http://www.bioperl.org/wiki/
> FAQ#How_do_I_do_motif_searches_with_BioPerl.3F_Can_I_do_.
> 22find_all_sequences_that_are_75.25_identical.22_to_a_given_motif.3F
>
> chris
>
> On Jun 13, 2007, at 7:20 PM, John Cumbers wrote:
>
> > Hello,
> >
> > I have a simple problem, I'm trying to search a genome sequence for
> > a motif,
> > I then want to output a BED file to display all the locations of
> > this motif
> > on the UCSC Genome Browser.  I could not find a script to do this,
> > so I
> > started to write my own.   I'm new to perl and my code below was my
> > attempt
> > to read the sequence string and output the index bp of the start of
> > each
> > motif.  With this I could build the BED file myself, which requires
> > start
> > and finish base pairs.
> >
> > For the first motif I can output the start index, but when I try
> > and read
> > the next one off the sequence it does not work.  Instead I just get an
> > output of a list of 1's.  I realise that this is more a request for
> > some
> > simple perl help, but any help much appreciated.
> >
> > Best wishes,
> > John
> >
> >
> > $seq_object = read_sequence
> > ("Drosophila.Chr3.test.AE014296.fasta");  #turn
> > my FASTA file into a seq object.
> > $sequence_as_a_string = $seq_object->seq();  #turn it into a string
> > # search $sequence_as_a_string  string for motif AAA as example
> > # if found, return the index that it is found at
> >
> > while ($sequence_as_a_string =~ m/AAA/g) {
> >   print "Found '$&'.  Next attempt at character " .
> > pos($sequence_as_a_string)+1 . "\n";
> > }
> >
> >
> >
> > --
> > John Cumbers,  Graduate Student
> > Biology and Medicine
> > Brown University, Box G-W
> > Providence, Rhode Island, 02912, USA
> > Tel USA: +1 401 523 8190,  Fax: +1 401 863-2166
> > UK to USA: 0207 617 7824
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list