|
In order to use the learned models in identifying putative binding sites within
given sequences, we should know how to measure their statistical confidences.
To score a given K-mer we use the log likelihood ratio between the
binding model and the background model.
We then would like to assign the statistical significance (p-value) to a given score,
i.e. to evaluate the probability of getting such a score (or a better one) for
a K-mer generated by the background model.
Exact p-value calculation
Ideally, we would like to use an exact method to describe the score distribution
and cumulative distribution, (see for example
Bejerano, RECOMB 2003).
This option, however, is not efficiently practical in our case due to the high complexity
of our models.
The Naive Approximation
Empirical p-values can be computed using a naive sampling approach.
Using this scheme, a very large pool of K-mers is sampled from the
background distribution, then each K-mer is scored according to its
log-likelihood ratio.
For each possible threshold, the cumulative distribution function (a CDF)
is calculated by counting how many K-mers from the pool have passed it.
The main drawback of this method concerns the huge number of samples needed for
the exact estimation of significant p-values, due to the spareness of
high-scoring samples in the background distribution. This problem only increases
as we introduce a Bonferroni Correction to the p-values, taking
into account the number of trials we've done when search for putative significance
sites.
Gaussian Approximation
In the cases where our model contains no hidden variables (e.g. a PSSM or a
Tree network), it is possible to derive decomposable exact formulas for the
score's first two moments under the background model. For other models, these
statistics can be approximated easily and accurately.
This opens the way to the approximation of the score distribution using a
Gaussian (i.e. a Normal distribution).
Approximation using Importance Sampling
Once again, as in the Naive approach, we'll approximate the statistical significance
of a K-mer using an empirical distribution function, derived from a
sampled set of K-mers. However, to avoid vast amounts of samplings
we can use the Importance Sampling method.
This technique approximates the background score distribution, but with a higher
resolution on areas of interest (i.e. those with extremely high scores). It does
so by sampling from a different distribution Q( ) over the K-mers,
and weighting the samples according to their likelihood ratio between the
background distribution and Q( ).
Formally speaking, for a sampled K-mer:
In our case, we want Q( ) to be a mixture of models:

where Q0( ) is merely the background model,
QI( ) is the binding model, and
p(Qi) is the
mixing ratio of the model Qi( ).
Figure 1: Example of Q( ) : a mixture of 5 distributions Q0 ... Q4
 |
 |
 |
 |
 |
| Q0 (bg model) |
Q1 |
Q2 |
Q3 |
Q4 (binding model) |
| 10,000 samples |
5,623 samples |
3,162 samples |
1,778 samples |
1,000 samples |
Comparison of score CDFs using 3 methods
The following two figures show cumulative distribution functions for the binding
site model used for our synthetic results (See paper, section 6a). Here,
1,000,000 K-mers were sampled from a Human 3-order markov chain
background model. These were assigned log-odds scores, and were used to build a
cumulative distribution function (Naive approximating - red
line). Another 1,000,000 K-mers were sampled from a
Normal distribution whose mean and variance were calculated using the
binding and background models (Normal distribution
approximating - green line). Using the Importance Sampling
method (blue line) 40,882 K-mers
were sampled from ten different models Q0 ... Q9
(9986, 7732, 5987, 4636, 3590, 2780, 2153, 1667, 1291, 1000 samples).

Although all three CDFs look very similar, zooming in to the interesting area
(of high scoring K-mers), show significant difference between the
Normal approximation, and the other two CDFs.

Note that the the Importance Sampling approximation works well,
resulting in a very accurate threshold for sites with
significant p-values (smaller than 0.01 after Bonferroni
Correction. With a typical promoter length of 500bp, checking on both
strands, this equals to a p-value of 0.00001).
The Normal distribution approximation, however, tends to predict a more
significant p-value than the real one (up to a 20-fold fluctuation),
resulting in many false-positive sites.
To summarize, this figures shows that using a small and efficient sample set,
the Importance Sampling method is capable of accurately approximating the
score CDF.
|