Introduction

Transcriptional regulation of genes is achieved by the concerted actions of multiple transcription factors with arrays of regulatory sequences on DNA and with each other. Consequently, traditional approaches to understand transcription regulation have focussed on the identification and combinatorial analysis of these cis-regulatory sequences.

With this tool, we focus on the context-dependent transcription factor binding sites (TFBSs) interactions that may yield an explanation why the expression of genes is modified in different directions given a particular condition. For that purpose, we build upon the distance difference matrix (DDM) concept from the field of structural biology where distance difference matrices are used as a means to compare protein structures to detect significant similarities and differences between related structures. We introduce the concept of 'distance' between TFBS as a measure for their degree of association and build distance matrices that summarize all TFBS associations for both sets of promoters of differentially regulated genes. Finally, by calculating the DDM and performing multidimensional scaling on the resulting matrix, we can distinguish TFBSs not contributing to the observed differential gene expression as they will be mapped in the bulk from   'deviating' TFBSs that are likely candidates to be responsible for the observed differential gene expression.


Available matrix sets

Choosing your set of matrices:

TRANSFAC 11.3 is the commercial dataset from TRANSFAC
JASPAR is the non-redundant dataset of annotated, high quality transcription factor binding sites (see the JASPAR website)
phyloFACTS is a dataset of matrices derived from statistically overrepresented, evolutionary conserved regulatory region motifs from mammalian genomes (see Xie et al. for more information)


Example data sets

We provide a number of examples, described in our paper:




Output

It takes one hour to obtain useful results, the exact time depending upon given promoter sets, chosen PWM-library, and server load.
Links to the output of the method will be sent to your e-mail address.
The output includes:

  • the 'input' files: The (validated) fasta files corresponding to the given promoter sets or the given gene names (RefSeqs or hugo ids).
  • A parameter file: A file describing the parameters used to run the TFdiff job. As there are: PWM-library, match thresholds, name of the two given promoter sets, job ID and e-mail address
  • TF lists: 2 files with .csv extension containing the TFs corresponding to the two groups, labeled by the PWM name, and coincided by a trend value (sum of the DDM values), p-value and q-value.
    An empirical significance calculation with 1000 random DDM-MDS runs is used to obtain these lists. Ideally more randoms should be run, a standalone version can be downloaded to accomplish this or if you wish it to be done with the latest TRANSFAC version, write us an e-mail.
  • Figure: The figure available is called "DDM_MDS_probabilistic.ps".
    The regular DDM-MDS plot (as described in the article) does not fully account for the different Information Content of each of the Positional Weight Matrices (PWMs). Hence the distance to the origin on a PWM-independent scale is not necessarily a good representative of the over- or underrepresentation of predicted sites of a PWM.
    The probabilistic DDM-MDS plot is derived from the regular plot by plotting each PWM away from the origin proportional to their -log(pvalue), keeping the original angle. This fully corresponds to the TF lists (described in the above paragraph). No plot is made if there are no significant results !