Modeling Dependencies in Protein-DNA Binding Sites:
Synthetic Experiments
We built several datasets, each consisting of both ``positive'' promoters, (i.e. sequences in which we planted binding site motifs), and ``negative'' ones. To simulate the underlying biological problem as accurately as possible, the motifs themselves were sampled from models trained on known binding sites of the Human LUN transcription factor from the TRANSFAC database (V$LUN1_01). In each setting, we created two parallel sets, one sampled from a tree network that contains position dependencies, and the other from a PSSM model. The promoter sequences were sampled from a 3-order Markov model background distribution, trained on Human promoter regions. To simulate noise, we have contaminated our datasets with another group of ``false positive'' promoters, where no motif was planted. We set the observation model such that all ``positive'' sequences had P(R=true|O) = 0.99, while the ``negative'' ones had P(R=true|O) = 0.01.

Generating modelTrue PositivesFalse PositivesLength (bp)
Tree1000500 ROCdataset
Tree5050500 ROCdataset
Tree2575500 ROCdataset
PSSM1000500 ROCdataset
PSSM5050500 ROCdataset
PSSM2575500 ROCdataset

Test data

Each line in the datasets contains:

Name [planted motif] <tab> Regulation probability <tab> Sequence