[Bioperl-l] Position scoring matrix objects

Stefan Kirov skirov at utk.edu
Fri Jul 25 14:27:06 EDT 2003


I am doing some research on cis-regulatory sites. I have looked through 
the bioperl mailing list and module documentation and it seems to me 
that there are no objects, sufficiently suitable for this task(holding 
and working with position scoring matrices and their occurance). So I 
wrote some motif related Perl modules, that might be (or not) of general 
interest and I would like to hear what people on the mailing list think 
about this. I would also be happy to get any suggestions and critics. By 
the way what I have done so far works only for DNA.

Here are the classes I have designed. There is no abstract interface at 
the moment. If people consider this important I can change it. I am 
documenting these modules and I am trying to follow the BioPerl structure.

SiteMatrix.

Synopsis: holds a position scoring matrix description and provides 
methods to extract different information from this object.

Methods:

new: construct from position scoring matrix hash, individual vectors can 
be supplied both as strings or arrays, takes the arguments as a hash

iupac- return IUPAC compliant consensus as a string

score- Returns the score as a real number

IC- information content. Returns a real number

id- identifier. Returns a string

accession- accession number. Returns a string

seq- return simple consensus (choose highest probability or N if prob 
too low), sequence

next_pos- return the sequence probably for each letter, IUPAC symbol, 
IUPAC probability and simple sequence consenus letter for this position. 
Rewind at the end. Returns a hash.

pos- current position get/set. Returns an integer.

regexp- construct a regular expression based on IUPAC consensus. For 
example AGWV will be [Aa][Gg][AaTt][AaCcGg]

width-self exp. Integer.

get_string- gets the probability vector for a single base as a string. 
Throws an exception if the argument is not in {A,C,G,T}.

When creating the object the constructor will check for positions that 
equal 0. If such is found it will increase the count for all positions 
by one and recalculate the frequency. Potential bug- if you are using 
frequencies and one of the positions is 0 it will change significantly. 
However, you should never have frequency that equals 0.

Throws an exception if:

You mix as an input array and string (for example A matrix is given as 
array, C – as string).

The position vector is (0,0,0,0).

One of the probability vectors is shorter than the rest.

The probabilities for A,C,G and T do not add up to 1 when you use string 
as input vectors.

Examples:

A probability matrix as a string can be:”8913a09” where a is actually 
10. This is merely done for compabilty with meme and transfac.

my ($a,$c,$g,$t,$score,$ic, $mid, $seq)=@_; #Either arrayref or string

my %param=(pA=>$a,pC=>$c,pG=>$g,pT=>$t,IC=>$ic,e_val=>$score, id=>$mid);

my $site=new SiteMatrix(%param);

my $regexp=$site->regexp;

my $count=grep($regexp,$seq);

my $count=($seq=~ s/$regexp/$1/eg);

print “Motif $mid is present $count times in this sequence\n”;

Parsers that return SiteMatrix objects:

Meme (the one, distributed with bioperl does not work, and I was unable 
to get answers from the list and the developer)

new(file)- associates the object with a meme file. Throws exception if 
the file is HTML format.

parse_next- returns the next motif in the file as a SiteMatrix object

Transfac

The methods are pretty much the same, but SiteMatrix object might have 
empty fields- for example transfac entry will not contain score and 
information content:

new

At the moment the parsers are implemented as two separate classes. This 
probably should change and follow the same. There is also no rigorous 
check for format violations.

InstanceSite holds information about an instance of a matrix A in the 
sequence B.

Methods:

new: creates object from a hash

id- sequence id

mid- motif id

sequence

relpos-relative to transcription start site, usually minus. Will be 
calculated if sequence length and position is supplied

matrix-get/sets the SiteMatrix, associated with this instance

diff- gets the number of mismatches based on regexp of SiteMatrix 
compared to the instance sequence

Example:

my 
%input=(score=>$score,start=>$pos,motif=>$id,seqid=>$llid,seq=>$sequence);

my $instance=new InstanceSite(%input);

Mast parser will return an array of SiteInstance objects. Very rudimentary.


-- 
Stefan Kirov, Ph.D.
University of Tennessee/Oak Ridge National Laboratory
1060 Commerce Park, Oak Ridge
TN 37830-8026
USA
tel +865 576 5120
fax +865 241 1965
e-mail: skirov at utk.edu
sao at ornl.gov




More information about the Bioperl-l mailing list