| Modeling Dependencies in Protein-DNA Binding Sites:
Synthetic Experiments |
| We built several datasets, each consisting of both ``positive'' promoters, (i.e. sequences in which we planted binding site motifs), and ``negative'' ones. To simulate the underlying biological problem as accurately as possible, the motifs themselves were sampled from models trained on known binding sites of the Human LUN transcription factor from the TRANSFAC database (V$LUN1_01). In each setting, we created two parallel sets, one sampled from a tree network that contains position dependencies, and the other from a PSSM model. The promoter sequences were sampled from a 3-order Markov model background distribution, trained on Human promoter regions. To simulate noise, we have contaminated our datasets with another group of ``false positive'' promoters, where no motif was planted. We set the observation model such that all ``positive'' sequences had P(R=true|O) = 0.99, while the ``negative'' ones had P(R=true|O) = 0.01. |
| Generating model | True Positives | False Positives | Length (bp) | ||
| Tree | 100 | 0 | 500 | ROC | dataset |
| Tree | 50 | 50 | 500 | ROC | dataset |
| Tree | 25 | 75 | 500 | ROC | dataset |
| PSSM | 100 | 0 | 500 | ROC | dataset |
| PSSM | 50 | 50 | 500 | ROC | dataset |
| PSSM | 25 | 75 | 500 | ROC | dataset |
Name [planted motif] <tab> Regulation probability <tab> Sequence