Activation functions for Deep Learning

The activation function (sometimes called as “transfer function”) decides whether the neuron’s input to the network is relevant or not. It determines how the weighted sum of the input is transformed into an output. They are also called “squashing function” if the output range is limited.

Introduction

Our body has different types of sensors that detect different kinds of information. The information is constantly entering the brain, processing it, and creating a thought. The thought sometimes may last a few seconds or it may last forever. Some information is so strong that it gets engraved in our minds. You could even remember the exact time even after years.

Another interesting thing is that our brain is never overwhelmed by the continuous acquisition of information. That’s because our brain makes use of different levels of memory storage. The information with low priority gets forgotten whereas those needing the highest attention get stored.

A similar process happens in artificial neural networks. Mathematical operations are done to determine if the input received is relevant or not. That’s where the activation function comes in.

An activation function decides whether a neuron should be activated or not.

Types of Activation Functions

There are three types of activation function

  • Linear activation function
  • Binary step function
  • Non-linear activation functions

Linear activation function

The linear activation function, also known as “no activation,” or “identity function”, is proportional to the input.
The function is defined as:

$f(x)=x$

It has a range between (+∞, -∞).

Binary step activation function

The Binary step activation function depends on a threshold value that determines whether a neuron should be activated or not. It compares the input value with a certain threshold. If the input value is greater than the threshold value, then the neuron is activated, otherwise deactivated which means that the output is not passed on to the next layer.
The function is defined as:

$f(x)=\left\{\begin{array}{ll}0 & \text { for } x<0 \\ 1 & \text { for } x \geq 0\end{array}\right\}$

The binary step activation function cannot be used for multi-class classification problems. It can be only used to build binary classifiers.

Linear activation functions and Binary step activation functions are not typically used in neural networks. The most efficient method to train a deep neural network is by using gradient descent with backpropagation. For the backpropagation algorithm to work, it should have a differentiable activation function.

  • In the case of a linear activation function, its derivative is a constant. Also, all layers of the neural network will collapse into one layer, if the linear function is used.
  • The step function is not differentiable at x = 0 and has 0 derivative elsewhere. So backpropagation won’t work well.

Non-linear activation function

Most of the activation functions we use in a neural network are non-linear. Some of the non-linear activation functions used are:

1. Sigmoid function

The sigmoid function takes any real-valued input and produces output in the range of 0 to 1. If the input value is large (more positive), then the output value will be closer to 1 whereas if the input value is small (more negative, then the output value will be closer to 0.
The sigmoid function is also called as logistic function. It has a characteristic S-shaped curve.
The sigmoid function is defined as:

$f(x)=\frac{1}{1+e^{-x}}$

import matplotlib.pyplot as plt
import numpy as np
import math
  
x = np.linspace(-10, 10, 100)
z = 1/(1 + np.exp(-x))
  
plt.plot(x, z, color='darkturquoise')
plt.xlabel("x")
plt.ylabel("Sigmoid(X)")

plt.title('Sigmoid activation function')
  
plt.show()
sigmoid activation function

If you want the output to be interpreted as a probability, then the sigmoid function is a good choice as it exists between the range (0, 1)

2. Hyperbolic Tangent function (Tanh)

The hyperbolic tangent activation function also called as Tanh function is similar to sigmoid function. It takes any real-valued input and produces output in the range of -1 to 1. If the input value is large (more positive), then the output value will be closer to 1 whereas if the input value is small (more negative, then the output value will be closer to -1.
The Tanh function is defined as:

$f(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$

import matplotlib.pyplot as plt
import numpy as np
import math
  
x = np.linspace(-10, 10, 100)
y = np.tanh(x)
  
plt.plot(x, y, color='darkturquoise')
plt.xlabel("x")
plt.ylabel("Tanh(X)")

plt.title('Tanh')
  
plt.show()
Tanh activation function

The output of tanh centers around 0. Hence negative inputs will be mapped strongly negative, zero input value near zero, and positive inputs mapped to strongly positive in the tanh graph.

3. ReLU (Rectified Linear unit)

The most common activation function used in hidden layers. ReLU avoids the vanishing gradient problem which was a limitation of Sigmoid and Tanh functions. If the input value is negative, then it returns 0, but for any positive value, it returns that value itself.
The ReLU function is defined as:

$f(x)= max(0, x)$

import matplotlib.pyplot as plt
import numpy as np
import math
  
x = np.linspace(-10, 10, 100)
y = np.maximum(0, x)
  
plt.plot(x, y, color='darkturquoise')
plt.xlabel("x")
plt.ylabel("ReLU(X)")

plt.title('ReLU')
  
plt.show()
4. Leaky ReLU

Leaky ReLU is a variant and improved version of ReLU. Instead of being 0 for negative input values, a leaky ReLU defines an extremely small, non-zero constant gradient α (usually, α=0.01).
Leaky ReLU solves the problem of “dying ReLU” problem. ReLU only cares about input values greater than zero. So what happens when the majority of inputs are negative? ReLU will output 0 and lots of neurons in the network are never activated. Hence no gradient flows and the ReLU stops the learning process thereby affecting neural network performance.
Leaky ReLU as the name suggests adds a small leak for negative values rather than making them 0. The negative values are multiplied by a small α.

The Leaky ReLU function is defined as:

$f(x)=\left\{\begin{array}{ll} x, & x>0 \\ \alpha x, & x \leq 0 \end{array}\right\}$

import matplotlib.pyplot as plt
import numpy as np
import math

def leaky_ReLU(x):
  data = [max(0.05*value,value) for value in x]
  return np.array(data, dtype=float)
  
x = np.linspace(-10, 10, 100)
y = leaky_ReLU(x)
  
plt.plot(x, y, color='darkturquoise')
plt.xlabel("x")
plt.ylabel("Leaky ReLU(X)")

plt.title('Leaky ReLU')
  
plt.grid()
plt.show()
Leaky ReLU

Role of activation functions in neural network

The role of the activation functions in a neural network is to introduce non-linearity.
Non-linearity means something which is not linear. We can hardly find linear problems. So we need non-linear functions that can approximate non-linear problems. In addition to adding non-linearity, every activation function has its own set of features.

Similar Posts

Leave a Reply