Introduction

A sequence motif is a short sequence conserved in multiple amino acid or nucleic acid sequences with biological significance such as DNA binding sites, where transcription factors (TFs) bind and catalytic sites in enzymes. Motifs are often represented as position count matrix (PCM)(Table 1), position frequency matrix (PFM)(Table 2) or position weight matrix (PWM) (reviewed in13). Sequence logos, based on information theory or probability theory, have been used as a graphical representation of sequence motifs to depict the frequencies of bases or amino acids at each conserved position48.

Table 1: Sample of PCM. Each row represents the counts of A, C, G and T, while each column is the counts for that nucleotide position.
A 1 1 23 22 1 2
C 19 16 0 0 2 4
G 2 4 0 1 19 15
T 1 2 0 0 1 2
Table 2: Sample of PFM. Each row represents the frequency of A, C, G and T, while each column is the frequency for that nucleotide position.
A 0.04 0.04 1 0.96 0.04 0.09
C 0.83 0.70 0 0.00 0.09 0.17
G 0.09 0.17 0 0.04 0.83 0.65
T 0.04 0.09 0 0.00 0.04 0.09

High-throughput experimental approaches have been developed to identify binding sequences for a large number of TFs in several organisms. The experimental methods include protein binding microarray (PBM)9, bacterial one-hybrid (B1H)10, systematic evolution of ligands by exponential enrichment (SELEX)11,12, High-Throughput (HT) SELEX13, and selective microfluidics-based ligand enrichment followed by sequencing (SMiLE-seq)14. Several algorithms have been developed to discover motifs from the binding sequences, identified by the above experimental methods. For example, MEME Suite15 has been widely used to construct motifs for B1H and SELEX data, while Seed-and-Wobble (SW)16, Dream517, and BEEML18 have been used to generate motifs for PBM data.

Visualization tools have been developed for representing a sequence motif as individual sequence logo, such as weblogo7and seqlogo19. To facilitate the comparison of multiple motifs, several motif alignment and clustering tools have been developed such as Tomtom20, MatAlign21, and STAMP22. However, these tools are lacking flexibility in dispalying the similarities or differences among motifs.

The motifStack package23 is designed for for the graphic representation of multiple motifs with their similarities or distances depicted. It provides the flexibility for users to customize the clustering and graphic parameters such as distance cutoff, the font family, colors, symbols, layout styles, and ways to highlight different regions or branches.

In this workshop, we will demonstrate the features and flexibility of motifStack for visualizing the alignment of multiple motifs in various styles. In addition, we will illustrate the utility of motifStack for providing insights into families of related motifs using several large collections of homeodomain (HD) DNA binding motifs from human, mouse and fly. Furthermore, we will show how motifStack facilitated the discovery that changing the computational method used to build motifs from the same binding data can lead to artificial segregation of motifs for identical TFs using the same motif alignment method.

Installation

BiocManager is used to install the released version of motifStack.

library(BiocManager)
install("motifStack")

To install the development version of motifStack, please try

install("jianhong/motifStack")
packageVersion("motifStack")
## [1] '1.29.5'

Please note that starting from version 1.28.0, motifStack requires cario (>=1.6) instead of ghostscript which is for a lower version of motifStack. The capabilities function could be used to check capablilities of cario for current R session. If motifStack version number is smaller than 1.28.0, the full path to the executable ghostscript must be available for R session.

In this tutorial, we will illustrate the functionalities of motifStack available in version >=1.29.4.

capabilities()["cairo"]
## cairo 
##  TRUE

Inputs of motifStack

The inputs of motifStack are one or more motifs represented as PCM or PFM. Motifs in PFM format are available in most of the motif databases such as TRANSFAC24, JASPAR25, CIS-BP26 and HOCOMOCO27. More databases could be found at http://meme-suite.org/db/motifs.

In addition, MotifDb package contains a collection of DNA-binding sequence motifs in PFM format. 9933 PFMs were collected from 11 sources: cisbp_1.0226, FlyFactorSurvey28, HOCOMOCO27, HOMER29, hPDI30, JASPAR25, Jolma201331, ScerTF32, stamlab33, SwissRegulon34 and UniPROBE9.

library(MotifDb)
MotifDb[1]
## MotifDb object of length 1
## | Created from downloaded public sources: 2013-Aug-30
## | 1 position frequency matrices from 1 source:
## |         cisbp_1.02:    1
## | 1 organism/s
## |        Scerevisiae:    1
## Scerevisiae-cisbp_1.02-M0001_1.02
MotifDb[[1]]
##             1           2           3           4           5           6
## A 0.009615385 0.009615385 0.971153846 0.009615385 0.009615385 0.971153846
## C 0.009615385 0.009615385 0.009615385 0.009615385 0.971153846 0.009615385
## G 0.009615385 0.009615385 0.009615385 0.009615385 0.009615385 0.009615385
## T 0.971153846 0.971153846 0.009615385 0.971153846 0.009615385 0.009615385
##             7           8
## A 0.009615385 0.009615385
## C 0.971153846 0.009615385
## G 0.009615385 0.009615385
## T 0.009615385 0.971153846

The plotMotifLogo function can be used to plot sequence logo for motifs in PFM format (Figure 1). As default, the nucleotides in a sequence logo will be drawn proportional to their information content (IC)8.

library(motifStack)
plotMotifLogo(MotifDb[[1]])
Plot sequence logo for motif in PFM format.

Figure 1: Plot sequence logo for motif in PFM format.

A PFM can be formated as an pfm object which is the preferred format for motifStack (Figure 2).

matrix.bin <- query(MotifDb, "bin_SANGER")
motif <- new("pfm", mat=matrix.bin[[1]], name=names(matrix.bin)[1])
motifStack(motif)
Plot sequence logo for a `pfm` object.

Figure 2: Plot sequence logo for a pfm object.

Similary, a PCM can be converted to a pcm object easily (Figure 3).

path <- system.file("extdata", package = "motifStack")
pcm <- read.table(file.path(path, "bin_SOLEXA.pcm"))
pcm
##   V1 V2  V3   V4   V5   V6   V7  V8   V9
## 1  A  | 462    0 1068 1025 1068   0 1019
## 2  C  |  71   60    0   24    0 993    0
## 3  G  | 504    0    0    0    0  12    0
## 4  T  |  31 1008    0   19    0  63   49
pcm <- pcm[,3:ncol(pcm)]
rownames(pcm) <- c("A","C","G","T") # must have rownames
motif <- new("pcm", mat=as.matrix(pcm), name="bin_SOLEXA")
plot(motif)
Plot sequence logo for a `pcm` object.

Figure 3: Plot sequence logo for a pcm object.

In addition, importMatrix function can be used to import the motifs into a pcm or pfm object from files exported by TRANSFAC24, JASPAR25 or CIS-BP26.

motifs <- importMatrix(dir(path, "*.pcm", full.names = TRUE))
motifStack(motifs)
Plot sequence logo for data imported by `importMatrix`

Figure 4: Plot sequence logo for data imported by importMatrix

Plot motifs with ggplot2

Sequence logo can also be plotted by geom_motif function which is compatible with ggplot236 (Figure 10).

len <- length(motifs)
motifs[[1]]$markers <- list(markerRect2, markerRect)
df <- data.frame(x=.5, y=(seq.int(len)-.5)/len, 
                 width=.75, height=1/(len+1))
df$motif <- motifs
library(ggplot2)
ggplot(df, aes(x=x, y=y, width=width, height=height, motif=motif)) +
  geom_motif(use.xy = TRUE) + theme_bw() + xlim(0, 1) + ylim(0, 1)
Plot sequence logo with ggplot2.

Figure 10: Plot sequence logo with ggplot2.

Plot multiple sequence logos

motifStack is designed to visulize the alignment of multiple motifs and facilitate the analysis of binding site diversity/conservation within families of TFs and evolution of binding sites among different species. There are many functions in motifStack such as plotMotifLogoStack, motifCloud, plotMotifStackWithRadialPhylog, motifPiles, and motifCircos for plotting motifs in different layouts, motifSignature for generating motif signatures by merging similar motifs. To make it easy to use, we created one workflow function called motifStack, which can be used to draw a single DNA, RNA or amino acid motif as a sequence logo, as well as to align motifs, generate motif signatures and plot aligned motifs as a stack, phylogeny tree, radial tree or cloud of sequence logos.

To align motifs, DNAmotifAlignment function implements Ungapped Smith–Waterman (local) alignment based on column comparison scores calculated by Kullback-Leibler distance37. DNAmotifAlignment requires the inputs be sorted by the distance of motifs. To get the sorted motifs for alignment, distances can be calculated using motifDistance in motIV (R implementation of STAMP22). Distances generated from MatAlign21 are also supported with a helper function ade4::newick2phylog38.

Plot sequence logo stack

As shown in above, motifStack function can be used to plot motifs as a stack of sequence logos easily. By setting the layout to “tree” (Figure 11) or “phylog” (Figure 12), motifStack will plot the stack of sequence logos with a linear tree.

motifStack(motifs, layout = "tree")
Plot sequence logo stack with tree layout.

Figure 11: Plot sequence logo stack with tree layout.

motifStack(motifs, layout="phylog", f.phylog=.15, f.logo=0.25)
Plot sequence logo stack with phylog layout.

Figure 12: Plot sequence logo stack with phylog layout.

Even though the layout of “phylog” allows more motifs to be plotted comparing to the layout of “tree”, both layouts only allow small sets of motifs to be readily compared. However, moderate throughput analysis of TFs has generated 100s to 1000s of motifs, which make the visualization of their relationships more challenging. To reduce the total number of motifs depicted, clusters of highly similar motifs can be merged with the motifSignature function using a distance threshold. The distance threshold is user-determined; a low threshold will merge only extremely similar motifs while a higher threshold can group motifs with lower similarity.

matrix.fly <- query(MotifDb, "FlyFactorSurvey")
motifs2 <- as.list(matrix.fly)
## format the name
names(motifs2) <- gsub("(_[\\.0-9]+)*_FBgn\\d+$", "", 
                       elementMetadata(matrix.fly)$providerName)
names(motifs2) <- gsub("[^a-zA-Z0-9]", "_", names(motifs2))

motifs2 <- motifs2[unique(names(motifs2))]
set.seed(1)
pfms <- sample(motifs2, 50)

## use MotIV to calculate the distances of motifs
jaspar.scores <- MotIV::readDBScores(file.path(find.package("MotIV"), 
                                               "extdata", 
                                               "jaspar2010_PCC_SWU.scores"))
d <- MotIV::motifDistances(lapply(pfms, pfm2pwm))
hc <- MotIV::motifHclust(d, method="average")
## convert the hclust to phylog object
phylog <- hclust2phylog(hc)
## reorder the pfms by the order of hclust
leaves <- names(phylog$leaves)
pfms <- pfms[leaves]
## create a list of pfm objects
pfms <- mapply(pfms, names(pfms), 
               FUN=function(.pfm, .name){
                 new("pfm",mat=.pfm, name=.name)})
## extract the motif signatures
motifSig <- motifSignature(pfms, phylog, groupDistance=0.01, min.freq=1)

Logos for these merged motifs can be plotted on the same dendrograms, using the relative size of the logos to reflect the number of TFs that contribute to the merged motif. Compared with the original dendrogram, the number of TFs with similar motifs and the relationships between the merged motifs can be more easily compared. The 50 motifs shown in the original image are now shown as only 28 motif clusters/signatures, with the members of each cluster clearly indicated. By setting the colors of dendrogram via parameter col.tree, the colors of lable of leaves via parameter col.leaves, and the colors of annotation bar via parameter col.anno, motifPiles function can mark each motif with multiple annotation colors (Figure 13).

## get the signatures from object of motifSignature
sig <- signatures(motifSig)
## get the group color for each signature
gpCol <- sigColor(motifSig)

library(RColorBrewer)
color <- brewer.pal(12, "Set3")
## plot the logo stack with pile style.
motifPiles(phylog=phylog, pfms=pfms, pfms2=sig, 
            col.tree=rep(color, each=5),
            col.leaves=rep(rev(color), each=5),
            col.pfms2=gpCol, 
            r.anno=c(0.02, 0.03, 0.04), 
            col.anno=list(sample(colors(), 50), 
                          sample(colors(), 50), 
                          sample(colors(), 50)),
            motifScale="logarithmic",
            plotIndex=TRUE,
            groupDistance=0.01)
Plot sequence logo stack by motifPiles.

Figure 13: Plot sequence logo stack by motifPiles.

Plot sequence logo cloud

To emphasize the weight of motif signatures, motif signatures can be plotted as a circular (Figure 14) or rectanglular word clouds with motif size representing the number of motifs that contributed to the signature. The larger the motif size, the larger the number of motifs within that motif signature.

## draw the motifs with a word-cloud style.
motifCloud(motifSig, scale=c(6, .5), layout="cloud")