Great Answers to
Questions About Everything


I have a weight matrix of length 20 x 15 (amino acids x sequence positions). Each element of my weight matrix is a relative probability

If I have a sequence say "AAPGTGASMHSGLLW" how would I score it against the matrix? I tried taking the product of probabilities corresponding to the matrix, but I end up with a really small number

Any ideas?


Consider the simple matrix:

    1    2   3     4
A 0.3 0.90 0.5 0.0001
B 0.2 0.05 0.4 0.2
C 0.5 0.05 0.1 0.8

The best match is, with a score of:

CAAC = 0.5 * 0.9 * 0.5 * 0.8 = 0.18

If you change the first letter to an B instead of C

you get a match, with a score of:

BAAC = 0.2 * 0.9 * 0.5 * 0.8 = 0.072

Which is a huge difference for such a small change... This is even worse with my larger matrix since the score is easily affected by small probabilities

{ asked by Omar }


The probabilities are correct. You must take the product (in log space this is equivalent to sum). The reason the probability looks small is just that you are perhaps thinking the score should be close to 1. However, this is not the case. To get a score of 1, you need the PWM to have 1/0/0/0 at all positions and get a perfect match.

So what should you compare to? What people usually do is compare this to a background distribution, the easiest being uniform, so the PWM is 0.25 everywhere. For your example, the score in this case will be 0.25^4 = ~0.004 and this is what you should expect by random.

This is why people usually look at the ratio between the score of the PWM relative to the score for the background model (and usually take the log2 of that), which in your case will be 0.18/0.004 = ~46 so the sequence you got is 46 times more than you would expect by random! And for your second example, 0.072/0.004 = ~18 times more than expected, so that is still high.

More conceptually, what you are doing is comparing two probabilistic models, your PWM and a background PWM, and comparing the probability to get your observed sequence according to each one of them. This is a common approach in general for comparing probabilistic models, even if they are more complicated.

{ answered by Bitwise }