# A Gentle Introduction to GRU Networks

## May 3, 2020

A recurrent neural network (RNN) is an extension of the conventional feed-forward neural network capable of accommodating variable-length input sequences. In recent years, RNNs have produced increasingly promising results in numerous machine learning tasks; especially those featuring varying input and output lengths such as machine translation. RNNs accomplish such results by having a recurrent unit, or hidden state, whose activation at each time step $t$ is dependent on that of the previous time step $t-1$. That is, given a variable-length input sequence $x=(x_1,x_2,…,x_T)$, an RNN updates its recurrent hidden state $h_t$ by

where $\phi$ is a some non-linear function. Traditionally, the recurrent hidden state is updated by computing $h_t = g(Wx_t + Uh_{t-1})$ where $g$ is a smooth, bounded function; $W$ is the set of weights for the input $x_t$; and $U$ is the set of weights from the previous hidden state $h_{t-1}$. The RNN may also have an output $y=(y_1,y_2,…,y_T)$, which, again, can be of variable-length.

However, RNNs are not perfect. Ultimately, their ability to capture long-term dependencies diminishes the longer the input becomes. Bengio et al. [2012, 1994] showed this to be a result of the gradients (error signals) eventually vanishing or, more rarely, exploding. The former is caused by the fact that back-propagation computes gradients using the chain rule—something which means that $n$ gradients are multiplied in $n$ layers, ultimately decreasing the signal exponentially to the point where the front layers either train very slowly, or stop training all together. The latter problem with exploding gradients refers to the opposite, and much rarer, instance in which long-term components grow exponentially more than short-term ones due to a large increase in the norm of the gradient during training.

As such, gradient-based optimisation methods struggle—both because of the variations in gradient magnitudes, and because the effect(s) of long-term dependencies is hidden by the effect(s) of short-term dependencies. Now, if you are strictly interested in short-term dependencies, this shortcoming obviously matters little. However, in most real-world data, long-term dependencies are present. Adequately capturing such dependencies in addition to accommodating short-term dynamics will thus almost always improve accuracy. We, therefore, need to upgrade our RNNs. We fundamentally have two ways in which we can do that; we can either (1) devise a better learning algorithm; or (2) design a more sophisticated activation function. In this post, we will examine the latter approach in the form of the **G**ated **R**ecurrent **U**nit (GRU) as proposed by Cho et al. [2014].

Note that an earlier solution to the vanishing gradients problem are the so-called

LongShort-termMemory (LSTM) units proposed by Hochreiter and Schmidhuber [1997]. If you are interested in learning more about LSTMs, please refer to this post. Generally, both GRUs and LSTMs outperform vanilla RNNs if long-term dependencies are important. However, whether you should choose GRU or LSTM ultimately depends on your data and model. I will discuss the advantages and disadvantages of both at the end of this post. Worth also noting is that we are strictly concerning ourselves with the issue of the vanishing gradients problem: Neither LSTMs nor GRUs will solve the issue of exploding gradients as discussed by Bengio et al. [2012].

### Introducing the Gated Recurrent Unit

As previously noted, GRUs are designed to address the vanishing gradients problem present in standard recurrent neural networks. The underlying idea is that by creating a more sophisticated activation function consisting of an affine transformation followed by element-wise non-linearity using gating units, long-term dependencies can be captured. That was quite the mouthful, but it is simpler than it might sound! No worries. Your brain taking note of this important—and hopefully consoling—piece of information, by the way, is a little like how GRUs operate.

What the above description really says is that we enable each recurrent unit to adaptively capture dependencies across different time steps. In practice, GRUs achieve this by incorporating special gating mechanisms which modulate the flow of information inside each unit. Specifically, a GRU consists of an **update gate** and a **reset gate**. If you are familiar with LSTM units, you will note that GRUs do not feature a separate memory cell—I will get to this momentarily. If you are not familiar with LSTMs, no need to worry; I will discuss both solutions at the end of this post. In essence though, both the update and reset gate are really just vectors whose job we make it to determine which information should be passed as output. What makes them special is that we can train them to retain certain information relevant to our model for longer periods of time.

The below illustration shows a visual representation of a single GRU. You might have seen similar illustrations with multiple units chained together. This is a common way of employing GRUs, though it all very much depends on your specific network architecture and what you wish to accomplish. In any case, each individual unit functions the same. In this post, we will just focus on a single GRU.

This might look confusing at first glance, but bear with me for a minute. If we start by introducing the notations before incrementally going through the flow, we have the following:

The plus operation is literally adding two elements together. The $\sigma$ and $\tanh$ functions are both activation functions for their respective gates—I will cover these in greater detail shortly, though if you are wholly unfamiliar with either, check this tutorial. The Hadamard-product operation denotes element-wise multiplication of matrices. Additionally, you will find variables $h$, $x$, $z$, $r$, and $t$; these, too, will be covered in greater detail below. What is worth pointing out, however, is how the coloured lines represent the flow of information throughout the unit.

### 1. Update Gate $z$

All GRUs take two inputs: $h_{t-1}$ and $x_t$.

The former, $h_{t-1}$, is the hidden state, or activation, $h$ from the previous time step $t-1$. In most cases, this value is passed along from another GRU, but it may also come from another component of our network. Regardless of where it comes from though, $h_{t-1}$ is an encoding of all information prior to the current state. $x_t$, on the other hand, represents the input from the current time step $t$.

In some instances, we might very well be interested in accounting for all past information, but in most cases, everything is not necessarily going to be equally relevant at all times. What you ate last week, for instance, is perhaps not as indicative of what you will be eating tonight as what you ate yesterday. As such, we need a mechanism to determine how much past information we should pass along. This is the job of the **update gate**.

We compute the **update gate** $z$ for time step $t$ using:

$z_t = \sigma(W_zx_t + U_zh_{t-1})$

When $x_t$ is plugged into the network unit, it is multiplied by its own weight $W_z$. This also applies to the input $h_{t-1}$, which is similarly multiplied by its own weight $U_z$. The results are then added together after which a sigmoid activation function is applied to squash the result between 0 and 1. Occasionally, some information will diminish in importance following the squash, though all past information could also be passed along largely unchanged in importance. This all depends on your data and is exactly the behaviour we want. The notion of taking a linear sum between the existing state and the new state is similar to that of LSTM units. However, unlike LSTMs, GRUs have no mechanism to control the degree to which its state is exposed. Instead, a GRU will expose the whole state every step. In other words: At each time step $t$, we consider all available information.

### 2. Reset Gate $r$

The **reset gate** is the mechanism by which a GRU forgets. That is, the job of the reset gate is to determine how much past information the unit should retain. This might sound similar to the update gate, but there is a distinct difference: The update gate determines what of both old and current information to pass along—the reset gate determines how much old information to forget using the new information $x_t$.

We compute the **reset gate** $r$ for time step $t$ using:

$r_t = \sigma(W_rx_t + U_rh_{t-1})$

Similar to the update gate, when $x_t$ is plugged into the network unit, it is multiplied by its own weight $W_r$, and $h_{t-1}$ is multiplied by its own weight $U_r$. Once done, the results are added together and a sigmoid activation function is applied accordingly. The only difference compared to the update gate is thus the weights and how the output of the gate is used.

### 3. Candidate Activation

At this point, having passed $h_{t-1}$ and $x_t$ through both the update and reset gates, we are ready to compute our candidate activation $h_{t}’$ using the current memory content. Note that this computation is similar to that of the traditional recurrent unit. Additionally, where the two previous steps relied only on $h_{t-1}$ and $x_t$, this third step will also incorporate the output of the reset gate. As such, when implementing GRUs, we are ultimately using three copies of both $h_{t-1}$ and $x_t$ for every time step $t$.

We compute the **candidate activation** $h’_t$ for time step $t$ using:

$h'_t = \tanh(Wx_t+r_t \circledcirc Uh_{t-1})$

Specifically, we start by multiplying the input $x_t$ with weight $W$ and $h_{t-1}$ with weight $U$. We then compute the Hadamard—that is, element-wise—product between the reset gate $r_t$ and $Uh_{t-1}$. This step determines what should be forgotten from the previous steps. In machine translation tasks, for instance, towards the end of a sentence, most values in the $r_t$ vector are likely close to 0. We sum the results from both calculations and apply the $\tanh$ activation function. Note that if $r_t$ features low values, the reset gate effectively makes the unit act as though it is reading the first symbol of the input sequence. It is this mechanism which allows it to forget the previously computed state(s). The result of the $\tanh$ activation function is that all values are squashed between $-1$ and $1$. Additionally, since only zero-valued inputs are mapped to near-zero outputs, the elements the reset gate determined should be forgotten will, in effect, be forgotten here.

### 4. Activation

At the fourth and final step of our GRU, we need to compute the activation $h_t$, which is the vector containing the information describing the current state. In order to do this, we need to employ the output of the previous step alongside that of the update gate as to determine what information is relevant.

The activation $h_t$ is a linear interpolation between the previous activation, or previous hidden state, $h_{t-1}$ and the candidate activation $h’_t$. Specifically, we compute the **activation** $h_t$ for time step $t$ using:

$h_t = z_t \circledcirc h_{t-1}+(1-z_t)\circledcirc h'_t$

We start by applying element-wise multiplication to the update gate $z_t$ and $h_{t-1}$. We then apply element-wise multiplication to $(1-z_t)$ and $h’_t$. The final result is obtained by summing the two intermediate results. The computed activation, or hidden state, $h_t$ can then be passed along to another GRU, an attention layer, or something else entirely. Note that since everything operates on individual time steps $t$, all the above will happen for each individual element of the input sequence.

This is all that happens inside a GRU.

### How do GRUs and LSTMs compare?

It is important to once again accentuate how both techniques seek to solve the same problem; namely that of the vanishing gradients. This makes both techniques relevant whenever dealing with data in which long-term dependencies are present. As such, it makes sense to briefly compare both solutions to conventional RNNs before comparing them directly: Indeed, shared by both GRUs and LSTMs is the additive component of their update from $t$ to $t+1$. This additive component does not exist in vanilla RNNs. Instead, vanilla RNNs replace the activation—that is, the content of the recurrent unit—with a value computed using only the current input and the previous hidden state. GRUs and LSTMs both retain the existing content and add more, new content on top. This has two distinct advantages as it (1) enables each unit to remember the existence of specific features from the input stream; and (2) creates a shortcut of sorts, bypassing several temporal steps and effectively enabling the gradient to be back-propagated without quickly vanishing as a result of passing through multiple bounded non-linearities.

As previously mentioned though, one thing which GRUs are missing compared to LSTMs is the controlled exposure of the memory content. In a LSTM unit, the amount of memory content that is see by other units in the network is controlled by a designated output gate. This gate does not exist in GRUs. Instead, all available information at the time is always exposed in full. This can be both an advantage and a disadvantage depending on your data.

Another difference between the two is the location of the input gate (LSTMs) and the corresponding reset gate (GRUs). The input gate of LSTMs controls the amount of the new memory content that is being added to the memory cell independently from its forget gate. This is unlike GRUs in which the flow of information from previous activations is tied to the update gate, which alone determines how much passes through.

However, in choosing whether to employ GRUs or LSTMs, there is unfortunately no clear-cut answer. In most cases, either will do just fine. In some cases, depending on your data and the rest of your network, one might outperform the other ever so slightly. This, essentially, was the conclusion reached by Chung et al. [2014], who sought to empirically evaluate GRUs and LSTMs in the context of sequence modelling. Fortunately, both PyTorch and Tensorflow offer plug-n-play implementations of both, allowing you to experiment accordingly.

It is conclusively important to stress that whilst GRUs and LSTMs have not exactly solved the problem of the vanishing gradients entirely, they have made it far more manageable. That is to say; given an input long enough, the vanishing gradients problem will eventually arise and alternative techniques such as attention are required.