[Bioperl-l] Position scoring matrix objects

Fri Jul 25 14:55:39 EDT 2003

Hey Stefan - good to see your post.

This is definitely something we want in Bioperl.  Old posts about this
some time last year I think.

The closest thing to this has been an auxiallary package by Boris Lenhard
and his group called TFBS (http://forkhead.cgb.ki.se/TFBS/) toolkit which
is quite useful for some of the things you describe.

I would definitely love to see some or all of the ideas contributed to
Bioperl - this would make a prime candidate for BioPAN if you wanted to
start with putting your code there.

Nat G et al were going to report back when they had started contributing
things to CPAN which would be consider part of this new attempt to get
more modules out there.  They might be able to suggest the best way
forward when that works.

-jason

On Fri, 25 Jul 2003, Stefan Kirov wrote:

> I am doing some research on cis-regulatory sites. I have looked through
> the bioperl mailing list and module documentation and it seems to me
> that there are no objects, sufficiently suitable for this task(holding
> and working with position scoring matrices and their occurance). So I
> wrote some motif related Perl modules, that might be (or not) of general
> interest and I would like to hear what people on the mailing list think
> about this. I would also be happy to get any suggestions and critics. By
> the way what I have done so far works only for DNA.
>
> Here are the classes I have designed. There is no abstract interface at
> the moment. If people consider this important I can change it. I am
> documenting these modules and I am trying to follow the BioPerl structure.
>
> SiteMatrix.
>
> Synopsis: holds a position scoring matrix description and provides
> methods to extract different information from this object.
>
> Methods:
>
> new: construct from position scoring matrix hash, individual vectors can
> be supplied both as strings or arrays, takes the arguments as a hash
>
> iupac- return IUPAC compliant consensus as a string
>
> score- Returns the score as a real number
>
> IC- information content. Returns a real number
>
> id- identifier. Returns a string
>
> accession- accession number. Returns a string
>
> seq- return simple consensus (choose highest probability or N if prob
> too low), sequence
>
> next_pos- return the sequence probably for each letter, IUPAC symbol,
> IUPAC probability and simple sequence consenus letter for this position.
> Rewind at the end. Returns a hash.
>
> pos- current position get/set. Returns an integer.
>
> regexp- construct a regular expression based on IUPAC consensus. For
> example AGWV will be [Aa][Gg][AaTt][AaCcGg]
>
> width-self exp. Integer.
>
> get_string- gets the probability vector for a single base as a string.
> Throws an exception if the argument is not in {A,C,G,T}.
>
> When creating the object the constructor will check for positions that
> equal 0. If such is found it will increase the count for all positions
> by one and recalculate the frequency. Potential bug- if you are using
> frequencies and one of the positions is 0 it will change significantly.
> However, you should never have frequency that equals 0.
>
> Throws an exception if:
>
> You mix as an input array and string (for example A matrix is given as
> array, C – as string).
>
> The position vector is (0,0,0,0).
>
> One of the probability vectors is shorter than the rest.
>
> The probabilities for A,C,G and T do not add up to 1 when you use string
> as input vectors.
>
> Examples:
>
> A probability matrix as a string can be:”8913a09” where a is actually
> 10. This is merely done for compabilty with meme and transfac.
>
> my ($a,$c,$g,$t,$score,$ic, $mid, $seq)=@_; #Either arrayref or string
>
> my %param=(pA=>$a,pC=>$c,pG=>$g,pT=>$t,IC=>$ic,e_val=>$score, id=>$mid);
>
> my $site=new SiteMatrix(%param);
>
> my $regexp=$site->regexp;
>
> my $count=grep($regexp,$seq);
>
> my $count=($seq=~ s/$regexp/$1/eg);
>
> print “Motif $mid is present $count times in this sequence\n”;
>
> Parsers that return SiteMatrix objects:
>
> Meme (the one, distributed with bioperl does not work, and I was unable
> to get answers from the list and the developer)
>
> new(file)- associates the object with a meme file. Throws exception if
> the file is HTML format.
>
> parse_next- returns the next motif in the file as a SiteMatrix object
>
> Transfac
>
> The methods are pretty much the same, but SiteMatrix object might have
> empty fields- for example transfac entry will not contain score and
> information content:
>
> new
>
> At the moment the parsers are implemented as two separate classes. This
> probably should change and follow the same. There is also no rigorous
> check for format violations.
>
> InstanceSite holds information about an instance of a matrix A in the
> sequence B.
>
> Methods:
>
> new: creates object from a hash
>
> id- sequence id
>
> mid- motif id
>
> sequence
>
> relpos-relative to transcription start site, usually minus. Will be
> calculated if sequence length and position is supplied
>
> matrix-get/sets the SiteMatrix, associated with this instance
>
> diff- gets the number of mismatches based on regexp of SiteMatrix
> compared to the instance sequence
>
> Example:
>
> my
> %input=(score=>$score,start=>$pos,motif=>$id,seqid=>$llid,seq=>$sequence);
>
> my $instance=new InstanceSite(%input);
>
> Mast parser will return an array of SiteInstance objects. Very rudimentary.
>
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu