For my Masterâ€™s thesis, Iâ€™m working on modeling some time-dependent sequences. There is a pretty rich set of literature associated with doing this, much of it related to addressing the unique challenges posed in voice recognition.

In order to understand the details of this post, it would be good to familiarize yourself with the following concepts, which will be touched on throughout the post:

**Restricted Boltzmann Machines**: An energy-based model, meaning that it takes an input dataset and â€średuces its energyâ€ť. Conceptually, an RBM is â€śrunâ€ť by alternating back and fourth between visible and hidden neurons; after alternating back and fourth for a while, the model will settle to a low-energy configuration (ideally our dataset). For a mathematical explanation, these resources are available:**Recurrent Neural Networks**: These are becoming very popular for modeling a whole host of things. Iâ€™ve previously written about language modeling with RNNs; other applications include computer vision (helping the computer figure out which parts of an image are important) and speech recognition (where temporal dependencies are very important).

Some references which explore this topic in greater detail can be found here:

**Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription**: This paper from the University of Montreal in 2012 presented a Restricted Boltzmann Machine model which was used to generate music. They tied together timesteps of a Restricted Boltzmann Machine with recurrent units, allowing it to generatively model time-varying sequences. They wrote a very nice tutorial on how to build their model here.**The Unreasonable Effectiveness of Recurrent Neural Networks**: This was a really popular blog post about using neural networks to generate text based on a reference text. There are many cool examples of applying this to other things, including generating Irish Folk music and speeches like Obama.**Speech Recognition with Deep Recurrent Neural Networks**: A pretty influential paper from Geoff Hinton and company in 2013 which explains LSTMs in greater detail.**Multimodal Learning with Deep Boltzmann Machines**: This paper describes using a probabilistic model to couple data from different modalities (in particular, vision and text) and generate one given the other. For example, given some keywords, their model is able to generate relevant images, and given an image, is able to generate descriptive keywords.

Most of this post will rely on using Theano. The general concepts can probably be ported over to another framework pretty easily (if you do this, I would be interested in hearing about it). It probably also helps to have a GPU, if you want to do more than try toy examples. You can follow the installation instructions here, although getting a GPU working with your system can be a bit painful.

The question which RBMs are often used to answer is, â€śWhat do we do when we donâ€™t have enough labeled data?â€ť Approaching this question from a neural network perspective would probably lead you to the autoencoder, where instead of training a model to produce some output given an input, you train a model to reproduce the input. Autoencoders are easy to think about, because they build on the knowledge that most people have about conventional neural networks. However, in practice, RBMs tend to outperform autoencoders for important tasks.

*Boulanger-Lewandowski, Bengio, and Vincent (2012)* suggests that unlike a regular discriminative neural network, RBMs are better at modeling multi-modal data. This is evident when comparing the features learned by the RBM on the MNIST task with those learned by the autoencoder; even though the autoencoder did learn some spatially localized features, there arenâ€™t very many multi-modal features. In contast, the majority of the features learned by the RBM are multimodal; they actually look like penstrokes, and preserve a lot of the correlated structure in the dataset.

By definition, the connection weights of an RBM define a probability distribution

`P(v) = \frac{1}{Z} \sum_{h}{e^{-E(v,h)}}`

Given a piece of data $\tilde{x}$, parameters $\theta$ are updated to increase the probability of the training data and decrease the probability of samples generated by the model

`-\frac{\delta \log p(x)}{\delta \theta} = \frac{\delta \mathscr{F}(x)}{\delta \theta} - \sum_{\tilde{x}}{p(\tilde{x})\frac{\delta \mathscr{F}(x)}{\delta \theta}}`

where $\mathscr{F}$ indicates the free energy of a visible vector, or the negative log of the sum of joint energies of that visible vector and all possible hidden vectors

`\mathscr{F}(x) = -\log \sum_{h}{e^{-E(x,h)}}`

The explicit derivatives used to update the visible-hidden connections are

`-\frac{\delta \log p(v)}{\delta W_{ij}} = E_v[p(h_i | v) * v_j] - v_j^{(i)} * sigm(W_i * v^{(i)} + c_i)`

In words, the connection between a visible unit $v_j$ and a hidden unit $h_i$ is changed so that the expected activation of that hidden unit goes down in general, but goes up when the data vector is presented, if the visible unit is on in that data vector.

Samples are â€śgenerated by the modelâ€ť by repeatedly jumping back and forth from visible to hidden units. It is not evident why this gives a probability distribution. Suppose you choose a random hidden vector; given the connections between layers, that vector maps to a visible vector. The probability distribution of visible vectors is therefore generated from the hidden distribution. We would like to mold the model so that our random hidden vector will be more likely to map to a visible vector in our dataset. If that doesnâ€™t work, we would like to tweek the model so that a random visible vector will map to a hidden vector that maps to our dataset. And so on. After training, running the model on a random probability distribution twists it around to give us a probability distribution of visible vectors that is close to our dataset.

The best way to think about what an RBM is doing during learning is that it is increasing the probability of a good datapoint, then running for a bit to get a bad datapoint, and decreasing its probability. It is changing the probabilities by updating the connections so that the bad datapoint is more likely to map to the good datapoint than the other way around. So when you have a cluster of good datapoints, their probabilities will be increased together (since they are close to each other, they are unlikely to be selected as the â€śbadâ€ť point of another sample in the cluster), and the probability of all the points around that cluster will be decreased. This illustrates the importance of increasing the number of steps of Gibbâ€™s sampling as training goes on, in order to get out of the cluster. This also gives some intuition on why RBMs learn multi-modal representations that autoencoders canâ€™t; RBMs find clusters of correlated points, while autoencoders only learn representations which minimize the amount of information required to represent some vectors.

As described above, there is some reason to think that an RBM model may learn higher-order correlations better than a traditional neural network. However, as they are conventionally described, they canâ€™t model time-varying statistics very well. For many applications this presents a serious drawback. The top answer on Quora for the question **Are Deep Belief Networks useful for Time Series Forecasting?** is by Yoshua Bengio, who suggests looking at the work of his Ph.D. student, Nicolas Boulanger-Lewandowski, who wrote the tutorial that much of this blog post is modeled around. In particular, the paper **Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription** and itâ€™s corresponding tutorial provide a good demonstration of doing almost exactly what Andrej Karpathyâ€™s blog post does, although instead of using RNNs to continually predict the next element of a sequence, it does something a little differrent.

The RNN-RBM uses an RNN to generate a visible and hidden bias vector for an RBM, and then trains the RBM normally (to reduce the energy of the model when initialized with those bias vectors and the visible vector at the first time step). Then the next visible vector is fed into the RNN and RBM, the RNN generated another set of bias vectors, and the RBM reduces the energy of that new configuration. This is repeated for the whole sequence.

What exactly does this training process do? Letâ€™s consider the application that is described in both the paper and tutorial, generating polyphonic music (polyphonic here just means there may be multiple notes at the same time step). The weight matrix of the RBM, which has dimensions `<n_visible, n_hidden>`

, provides features that activate individual hidden units in response to a particular pattern of visible units. For music, these features are chords, which played at each timestep to generate a song; for video, these features are individual frames, which are very similar to the features learned on the MNIST dataset.

The RNN part is trained to generate biases that activate the right features of the RBM in the right order; in other words, the RNN tries to predict the next set of features given a past set. When we switch the RBM from learning a probability distribution to generating one, the RNN is used to generate biases for the RBM, defining a pattern of activating filters. The stochasticity of the RBM is what gives the model its nondeterminism.