This tutorial is a bare-bones introduction to Theano, in the style of Andrew Traskâ€™s Numpy example. For a more completeÂ version, see the official tutorial. It is mostly to help me learn to use Theano, and feedback is more than welcome.

I used Python 3.5 and Theano 0.8. If you already have Theano set up, skip this. Otherwise, see the installation instructions here; usually this means doing aÂ `pip install Theano`

.

Here is just the code. The network has 5 hidden neurons and learns the XOR function, which takes two inputs and returns a highÂ output only if exactly one of the inputs is high. Otherwise, it returns a low output.

Ok, letâ€™s see whatâ€™s going on here.

Here, weâ€™re creating shared variables X and y, representing our inputs and outputs, respectively. Shared variables are like global variables in a programming language; they are shared between functions, such as the functions â€śtrainâ€ť and â€śtestâ€ť later on. We also initializeÂ a random number generator â€śrngâ€ť and define a learning rate.

Here, we defineÂ a function which creates and returnsÂ a matrix of random numbers between -1.0 and 1.0, whose size we specify. The matrix is also a shared variable. We use this function to create the weights W1 and W2 for our network.

We finally get into constructing the network.Â Theano usefullyÂ includes the â€śsigmoidâ€ť function, which is used as the networkâ€™sÂ activation function. We multiply the input vector X by the first weight matrix and apply the activation function; we then take this output and multiply it by the second weight matrixÂ before again applying the activation function.

For the neural network, we would like to minimize the squared error of the network,Â which is shown in our â€ścostâ€ť function. The squared error is the difference between the output of the network and the desired output. Since this is a binary classification task, we have one output, 0 or 1.Â If the XOR function accepts, we would like the network to output a 1; otherwise, output a 0.

The last part, â€śupdatesâ€ť, defines how we want to change our network on each update step. We do this by trying to minimize the cost function with respect to the weights. This can be done with stochastic gradient descent; we calculate the gradient of the cost function with respect to the weights, and change the weights in the direction that causes the cost function to go down. Theano does this for us, using the â€śgradâ€ť function.

As an aside, because of the way we defined our weight matrices,Â the first multiplication / activation function increases theÂ dimensionality of the input vector to 3 dimensions. This is important for allowing the network to learn the XOR function.Â This post provides someÂ intuition about why this is the case. For many applications, however, we are more concerned with reducing the dimensionality of our input vector.

Here, we define our â€śtrainâ€ť and â€śtestâ€ť functions. The â€śtrainâ€ť function updates the weights according to the update rules we provided earlier, after calculating the cost function. The â€śtestâ€ť function gives us the output of the network. We then run the network through 60000 training steps. After training, we print the output, and lo and behold, it approximates our XOR function pretty well!