Automatic differentiation

Author

Marcin Płodzień

1 Calculating gradients

We can distinguish three methods for calculating (partial) derivatives of a function \(f: \mathbb{R}^m \to \mathbb{R}\) in a computer program.

  1. Finite differences method:

Simplest approach is to use finite difference method. For a given function \(f\), the components of its gradient

\[\begin{equation} \nabla f = \big(\frac{\partial f}{\partial x_1},\cdots,\frac{\partial f}{\partial x_m}\big), \end{equation}\] can be approximated as \[\begin{equation} \frac{\partial f(\vec{x})}{\partial x_i} \sim \frac{ f(\vec{x} + h\vec{e}_i) - f(\vec{x})}{h}, \end{equation}\] where \(\vec{e}_i \in \mathbb{R}^m\) is the \(i\)-th unit vector and \(h\) is small step size. However, aproximating \(\nabla f\) requires \(\mathcal{O}(m)\) evaluations of f. Additionally round-off errors due to floating-point arithmetic dominate the errors as \(h\to 0\).

  1. Symbolic differentiation

Calculating derivatives is done by automated manipulation of mathematical expressions allowing obtain explicit results (with help of computer algebra systems such us Mathematica, Maple, Maxima). For example for two given functions \(f(x)\) and \(g(x)\) derivative of their product is calculated explicitly following

\[\begin{equation} \frac{d}{dx}(f(x)g(x)) = \frac{df(x)}{x}g(x) + f(x)\frac{dg(x)}{dx} \end{equation}\]

The benefit of symbolic expression is that results are interpretable and allow find analytical solutions. However, symbolic derivatives generated through symbolic differentation typically do not allow for efficient calculations of derivative values.

  1. Automatic differentiation allows to obtain exact numerical value of the derivatives without the need of the symbolic expression - this method lies between symbolic differentiation and numerical differentiation.

The logic lying behind AD is that all numerical computations are compositions of a finite set of elementary operations for which derivatives are known. By combining the derivatives of the constituent operations through the computational diagram, and applying the chain rule for derivatives, we can obtain exact numerical of the derivative of the overall expression for a given input.

2 Computational graph

A computational graph is a graph-based representation of a mathematical computation. It is a way of visually representing the operations performed in a computation, and the dependencies between these operations.

2.1 Forward function evaluation

Let us consider expression \(f(x_1,x_2) = \ln x_1 + \cos x_2 - x_1 x_2\), Function \(f\) is a mapping \(f: \mathbb{R}^n \to \mathbb{R}^m\) with \(n = 2\), \(m=1\).

Following [1], we introduce the notation for the computational graph as follows:

  1. Input variables are denoted as \(v_{1-i}\), where \(i = 1,\dots,n\).

  2. Intermediate variables are denoted as \(v_i\), \(i = 1,\dots,l\).

  3. Output variables are denoted as \(v_{l+i}\), \(i = 1,\dots,m\).

The computational graph related to considered function \(f(x_1,x_2)\) is

computational-graph.png
Exercise

Calculate value of \(f(x_1,x_2)\) at \((x_1, x_2) = (2,1)\) via passing the diagram from left to right: forward-pass.png

2.2 Calculating gradients

Automatic differentiation allows us to calculate exact value of the gradient at given point. In our example, we are interested in value of \(\frac{\partial f}{\partial x_1}\) at given point \((x_1, x_2) = (2,1)\). This can be obtain in two modes.

2.2.1 Forward-mode AD

Forward-mode AD is implemented by complementing each intermediate variable \(v_i\) with a derivative: \[\begin{equation} \dot{v}_i = \frac{\partial v_i}{\partial x_1}, \end{equation}\] and by applying chain rule for differentiation we can obtain desired gradient. Derivativeas are propagated forward in sync with the function evaluation.

Exercise

Calculate value of \(\frac{\partial f}{\partial x_1} = \dot{v}_5\) at \((x_1, x_2) = (2,1)\) through passing the diagram: forward-mode-AD.png

Dual numbers and forward-mode AD

In practive, forward-mode is implemented by extending the algebra of real numbers via introducing \(\textit{dual}\) numbers, defined as \[\begin{equation} \tilde{z}_1 = a_1 +\epsilon b_1, \end{equation}\] where \(a,b \in \mathbb{R}\), and \(\epsilon^2 = 0\). Next, addition and multiplication of dual numbers is defined as:

  1. Addition: \(z_1 + z_2 = (a_1 + a_2) + \epsilon(b_1 + b_2)\)

  2. Multiplication: \(z_1z_2 = a_1a_2 + + a_1b_2\epsilon +b_1a_2\epsilon + b_1b_2\epsilon^2 = a_1a_2 + \epsilon(a_1b_2+a_2b_1)\)

Next, when we consider Taylor series expansion around \(\epsilon\), we have

\[\begin{equation} f(z) = f(a+\epsilon) = f(a) + f'(a)\epsilon + \frac{1}{2}f''(a)\epsilon^2 + \dots, \end{equation}\] we see that this simplifies to \[\begin{equation} f(a+\epsilon) = f(a) + \epsilon f'(a), \end{equation}\] which means that operations on dual number \(a\) automatically provides numerical value for \(f(a)\) and derivative \(f'(a)\).

In numerical implementation, dual numbers are handled by operator overloading where all mathematical operators are working appropriately on the new algebra of dual numbers.

2.2.2 Reverse-mode (backpropagation) AD

In a reverse mode we calculate gradients backwards. Let’s have a look at our computational graph once more time:

computational-graph.png

We are interested in calculating derivative of \(y\) with respect to \(v_i\), i.e. $ $. For a computational graph we can write the chain rule as \[\begin{equation} \frac{\partial y_j}{\partial v_i} = \frac{\partial y_j}{\partial v_k}\frac{\partial v_k}{\partial v_i}, \end{equation}\] where \(v_k\) is a parent of a \(v_i\) in a computational graph. When \(v_i\) has more than one parent we sum up the chain rule as: \[\begin{equation} \frac{\partial y_j}{\partial v_i} = \sum_{p\in parents(i)} \frac{\partial y_j}{\partial v_p}\frac{\partial v_p}{\partial v_i}. \end{equation}\] In the literature the above expression is called as \(\textit{adjont}\) and denoted as \[\begin{equation} \bar{v}_i = \frac{\partial y_i}{\partial v_i}. \end{equation}\]

Next, we can rewrite the adjont in term of the adjonts of the parents, i.e. \[\begin{equation} \bar{v}_i = \sum_{p\in \text{parents(i)}} \bar{v}_p \frac{\partial v_p}{\partial v_i} \end{equation}\] which gives us a recursive algorithm node \(y\) with setting starting point as \(\bar{y} = 1\).

Let’s write parents of each node in our example: \[\begin{equation} \begin{split} \text{parents}(i=5) &\to y \\ \text{parents}(i=4) &\to \{v_5\} \\ \text{parents}(i=3) &\to \{v_5\} \\ \text{parents}(i=2) &\to \{v_4\} \\ \text{parents}(i=1) &\to \{v_4\} \\ \text{parents}(i=0) &\to \{v_2, v_3\} \\ \text{parents}(i=-1) &\to \{v_1, v_2\}\\ \end{split} \end{equation}\]

Now we can write adjonts: \[\begin{equation} \begin{split} \bar{v}_5 & = \bar{y} \\ \bar{v}_4 & = \bar{v}_5\frac{\partial v_5}{\partial v_4}\\ \bar{v}_3 & = \bar{v}_5\frac{\partial v_5}{\partial v_3}\\ \bar{v}_2 & = \bar{v}_4\frac{\partial v_4}{\partial v_2}\\ \bar{v}_1 & = \bar{v}_4\frac{\partial v_4}{\partial v_1}\\ \bar{v}_0 & = \bar{v}_2\frac{\partial v_2}{\partial v_0} + \bar{v}_3\frac{\partial v_3}{\partial v_0} \\ \bar{v}_{-1} & = \bar{v}_1\frac{\partial v_1}{\partial v_{-1}} + \bar{v}_2\frac{\partial v_2}{\partial v_{-1}} \\ \end{split} \end{equation}\]

Finally, we notice that \[\begin{equation} \begin{split} \bar{v}_0 & = \bar{x}_2 = \frac{\partial y}{\partial x_2}\\ \bar{v}_{-1} & = \bar{x}_1 = \frac{\partial y}{\partial x_1}. \end{split} \end{equation}\]

In other words, with the single backward pass we have both \(\frac{\partial y}{\partial x_1}\) and \(\frac{\partial y}{\partial x_2}\) (in forward mode we can obtain \(\frac{\partial y}{\partial x_1}\) in one pass).

Exercise

Calculate value of \(\frac{\partial f}{\partial x_1} = \dot{v}_5\) at \((x_1, x_2) = (2,1)\) through passing the diagram: backward-pass-AD.png

How to choose between forward-mode and reverse-mode?

Let’s consider function \(f:\mathbb{R}^m \to \mathbb{R}^n\)

  1. If \(m \ll n\), i.e. number of inputs is much smaller than number of outputs, from computational point of view it is more faborable to use forward-mode automatic differentiation.

  2. If \(m \gg n\), i.e. number of inputs is much larger than number of outputs (and this is the case of neural networks), from computational point of view it is more faborable to use backward-mode automatic differentiation.

2.3 Calculating gradietns with PyTorch: torch.autograd

torch.autograd is PyTorch’s automatic differentiation engine that helps in neural network training.

2.3.1 Example 1

To compute the gradient of a scalar function \(f\) with respect to a single variable \(x\), we can use PyTorch’s autograd module. For example:

import torch

# Create a tensor with requires_grad set to True
x = torch.tensor([4.0], requires_grad=True)

# Define a scalar function f
def f(x):
    return x ** 2

# Compute the gradient of f with respect to x
y = f(x)
y.backward()

# The gradient of f with respect to x is stored in x.grad
print(x.grad)
tensor([8.])

2.3.2 Example 2

To compute the gradient of a function with respect to multiple variables, we can pass a tensor with requires_grad set to True to the function and then use the backward method on the resulting tensor.

import torch

# Create tensors with requires_grad set to True
x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([1.0], requires_grad=True)
z = torch.tensor([1.0], requires_grad=True)
# Define a function that takes two variables as input and returns their sum
def f(x, y, z):
    return torch.log(torch.sin(x) + torch.tanh(y**2))/z

# Compute the gradient of f with respect to x and y
g = f(x, y, z)
g.backward()

# The gradients of f with respect to x and y are stored in x.grad and y.grad
print(x.grad)   
print(y.grad)   
print(z.grad)
tensor([0.3370])
tensor([0.5240])
tensor([-0.4719])

2.3.3 Example 3

Automatic differentiation can be used in more complicated problems.

Let us consider eigenproblem for the double well potential modeled as a quantum harmonic oscillator with the barrier modeled as a gaussian profile:

\[\begin{equation} \begin{split} \hat{H}\psi_n(x) & = E_n(x)\psi_n(x) \\ \hat{H} & = -\frac{1}{2}\frac{\partial^2}{\partial x^2} + \frac{1}{2}x^2 + \kappa e^{-x^2/2} \end{split} \end{equation}\]

We are interested in change of the ground state energy \(E_{0}\) as a function of \(\kappa\) parameter. To tackle this problem, we have to first calculate eigenstates of the considered Hamiltonian.

Let us consider equaly distributed set of points \(x_i\) lying in interval \([-L/2, L/2]\) with \(\Delta x = \frac{L}{N_x}\), where \(N_x\) is a number of discretization points.

First, we have to, i.e. solve the eigenproblem of the form \[\begin{equation} \bar{H}\vec{\psi}_n = E_n \vec{\psi}_n, \end{equation}\] where \(\bar{H}\) is \(N_x\times N_x\) matrix representation of Hamiltonian \(\hat{H}\) in a discretized space. After constructing the matrix \(\bar{H}\), we can diagonalize it, and find eigenvectors \(\vec{\psi}_n\), and corresponding eigenvalues \(E_n\). Note, \(n\in \{0,N_x\}\).

Discrete second-order derivative and Hamiltonian matrix representation

To construct the matrix representation of \(\hat{H}\), we have to implement discrete second order derivative: \[\begin{equation} \frac{\partial^2 \psi(x)}{\partial x^2}\big|_{x} \approx \frac{\psi_{i+1} - 2\psi_i + \psi_{i-1}}{\Delta x^2} \end{equation}\]

The matrix representation of the kinetic part of the Hamiltonian \(\hat{H}\) is

\[\begin{equation} \bar{H}_{\text{T}} = -\frac{1}{2}\frac{1}{\Delta x}\begin{bmatrix} -2 & 1 & 0 & 0 & 0 & . \\ 1 & -2 & 1 & 0 & 0 & . \\ 0 & 1 & -2 & 1 & 0 & . \\ 0 & 0 & 1 & -2 & 1 & . \\ 0 & 0 & 0 & 1 & -2 & 1 \\ . & . & . & . & . & . \end{bmatrix} \end{equation}\]

The matrix representation of the potential part of the Hamiltonian \(\hat{H}\) is a diagonal matrix with elements \([\bar{H}_\text{V}]_{i,j} = \delta_{i,j} \big(\frac{1}{2}x_i^2 + \kappa e^{-x_i^2/2} \big)\)

Now, the Hamiltonian matrix representation is simply \(\bar{H} = \bar{H}_T + \bar{H}_V\).

Let’s import necessary libraries

import matplotlib.pyplot as plt
import torch
import numpy as np
from torch.linalg import eigh

Let’s define function returning matrix of the considered Hamiltonian. First, let’s define system size \(L = 10\), \(N_x = 500\) discretization points, function returning external potential and function returning matrix representation of the Hamiltonian:

L = 10
Nx = 100
x = torch.linspace(-L/2,L/2,Nx)
dx = x[1]-x[0]



def get_potential(x,kappa):
    return 0.5*x**(2.0) + kappa*np.exp(-(x)**2/2) 
    
def get_H(kappa):
    H_T = torch.zeros((Nx,Nx))
    ones = torch.ones(Nx)
    H_T = -0.5/dx**2*( torch.diag_embed(ones[:-1], offset = 1) - 2*torch.diag_embed(ones, offset = 0) + torch.diag_embed(ones[:-1], offset = -1))  
    H_V = torch.diag(get_potential(x,kappa))
    H = H_T + H_V
    return H

Let’s have a look at potential shape for different parameters \(\kappa\):

kappa_vec = torch.tensor([0, 2, 4, 10])
 
N_kappa = kappa_vec.size(0)
 
fig, ax = plt.subplots(1, N_kappa, figsize=(16,4))
FontSize = 16
for kappa_i in range(0, N_kappa):
    kappa = kappa_vec[kappa_i]
    V = get_potential(x,kappa)
    ax[kappa_i].plot(x,V.detach().numpy())
    ax[kappa_i].set_title(r"$\kappa = $ " + "{:2.2f}".format(kappa.item()))
    ax[kappa_i].set_xlabel("x",fontsize=FontSize)
ax[0].set_ylabel(r"$V(x)$",fontsize=FontSize)
plt.show()

Pytorch function torch.eigh calculates eigenfunctions and eigenvalues of a given matrix: this allows us to write two functions returning ground state energy \(E_{\text{GS}}\), and ground state gap \(\Delta E\), i.e. energy difference between two first eigenenergies:

def get_energy_ground_state(kappa):
    H = get_H(kappa)
    Energies, Vectors = eigh(H)
    E_GS = Energies[0]    
    return E_GS

def get_gap(kappa):
    H = get_H(kappa)
    Energies, Vectors = eigh(H)
    gap = Energies[1] - Energies[0]    
    return gap

Let’s see density of few first eigenstates of the Hamiltonian for given parameter \(\kappa\):

fig, ax = plt.subplots(3, N_kappa, figsize=(16,12))
FontSize = 16
for kappa_i in range(0, N_kappa):
    kappa = kappa_vec[kappa_i]
    V = get_potential(x,kappa)
    ax[0,kappa_i].plot(x,V.detach().numpy())
    ax[0,kappa_i].set_title(r"$\kappa = $ " + "{:2.2f}".format(kappa.item()),fontsize=FontSize)
    ax[0,kappa_i].set_xlabel("x",fontsize=FontSize)

    H = get_H(kappa)
    Energies, Vectors = eigh(H)
    rho = torch.abs(Vectors)**2/dx
    n_max =  4            # maximal number of eigenstates
    E_max = Energies[n_max] #
 
    for i in range(0,n_max):
        ax[1,kappa_i].plot(x,rho[:,i].detach().numpy())
    ax[2,kappa_i].plot(Energies[0:n_max].detach().numpy(),'x')
   
    ax[0,kappa_i].set_xlabel("-0",fontsize=FontSize)

    ax[2,kappa_i].set_xlabel(r"n",fontsize=FontSize)
    ax[2,kappa_i].set_yticks(np.arange(0,E_max,0.5))
    
    
ax[0,0].set_ylabel(r"$V(x)$",fontsize=FontSize) 
ax[1,0].set_ylabel(r"$|\psi_n(x)|^2$",fontsize=FontSize) 
ax[2,0].set_ylabel(r"$E_n$",fontsize=FontSize) 
plt.show()

Now, let’s check how ground state energy changes with \(\kappa\):

fig = plt.figure()
E_gs_vs_kappa = []
gap_vs_kappa = []
kappa_max = 11
kappa_vec = np.linspace(0,kappa_max,100)
for kappa in kappa_vec:
    E_gs_vs_kappa.append([get_energy_ground_state(kappa)])
    gap_vs_kappa.append([get_gap(kappa)])
    
FontSize = 16
fig, ax = plt.subplots(1,2,figsize=(10,4))
ax[0].plot(kappa_vec, E_gs_vs_kappa, '--')
ax[1].plot(kappa_vec, gap_vs_kappa, '--')
ax[0].set_xlabel(r"$\kappa$",fontsize=FontSize)
ax[0].set_ylabel(r"ground state energy $E_{GS}$",fontsize=FontSize)
ax[1].set_xlabel(r"$\kappa$",fontsize=FontSize)
ax[1].set_ylabel(r"energy gap $\Delta{E}$", fontsize=FontSize)
Text(0, 0.5, 'energy gap $\\Delta{E}$')
<Figure size 640x480 with 0 Axes>

Finally, let’s calculate derivative \(\frac{d E_{\text{GS}}(\kappa)}{d \kappa}\) at given \(\kappa\) using torch.autograd:

kappa_fixed = torch.tensor(10., requires_grad = True)
E_GS = get_energy_ground_state(kappa_fixed)
E_GS.backward()
print(kappa_fixed.grad)
tensor(0.1249)

Let’s plot \(\frac{d E_{\text{GS}}(\kappa)}{d \kappa}\), and \(\frac{d \Delta E_{\text{GS}}(\kappa)}{d \kappa}\)

dEdkappa = []
dgapdkappa = []
for kappa in kappa_vec:
    kappa_fixed_1 = torch.tensor(kappa, requires_grad = True)
    E_GS = get_energy_ground_state(kappa_fixed_1)
    E_GS.backward()
    diff = kappa_fixed_1.grad
    dEdkappa.append([diff.item()])
    
    kappa_fixed_2 = torch.tensor(kappa, requires_grad = True)    
    gap = get_gap(kappa_fixed_2)
    gap.backward()
    diff = kappa_fixed_2.grad
    dgapdkappa.append([diff.item()])    

fig, ax = plt.subplots(1,2,figsize=(14,4))
ax[0].plot(kappa_vec,dEdkappa,'--')
ax[1].plot(kappa_vec,dgapdkappa,'--')
FontSize=16
ax[0].set_xlabel(r"$\kappa$",fontsize=FontSize)
ax[0].set_ylabel(r"$\frac{dE}{d\kappa}$",fontsize=FontSize)

ax[1].set_xlabel(r"$\kappa$",fontsize=FontSize)
ax[1].set_ylabel(r"$\frac{d\Delta E}{d\kappa}$",fontsize=FontSize)
Text(0, 0.5, '$\\frac{d\\Delta E}{d\\kappa}$')

3 Gradients in a Deep Neural Network

We can then express the Neural Network \(K\) layers as a many-level function composition

\[\begin{equation} \vec{y} = (f_K\circ f_{K-1} \circ \dots f_1)(\vec{x}) = f_K(f_{K-1}(\cdots(f_1(\vec{x}))), \end{equation}\]

where \(\vec{x}\) are inputs, while \(\vec{y}\) are outputs (e.g. predicted labels), and \(j = 0,\dots,K\) enumerates layers of the network. The input data flow can be decomposed in the following steps:

\[\begin{equation} \begin{split} f_0 & \equiv x \\ f_1 & = \sigma_1(W_0 f_0 + b_0) \\ f_2 & = \sigma_2(W_1 f_1 + b_1) \\ \vdots\\ f_j & = \sigma_j(W_{i-1} f_{j-1} + b_{j-1})\\ \vdots\\ f_K & = \sigma_K(W_{K-1} f_{K-1} + b_{K}). \end{split} \end{equation}\]

Since we’ll mostly be discussing autograd in the context of training, our output of interest will be the model’s loss function \(L\), which is a single-valued scalar function of the model’s output. This function expresses how far off our model’s prediction was from a particular input’s ideal output.

We try to obtain that by an iterative process of passing the training data, and updating the weights according to gradient of the loss function with respect to the training parameters \(\theta = \{W_0, b_0, W_1, b_1, \dots, A_{K-1}, b_k\}\). The gradient of the loss function with respect to the parameters set \(\theta\) requires the partial derivatives with respect to \(\theta_j = \{W_j, b_j\}\) of each layer \(j = 1,\dots,K\):

\[\begin{equation} \begin{split} \frac{\partial L}{\partial \theta_{K-1}} & = \frac{\partial L}{\partial f_K}\color{blue}{ \frac{\partial f_K}{\partial \theta_{K-1}}} \\ \frac{\partial L}{\partial \theta_{K-2}} & = \frac{\partial L}{\partial f_K}\color{red}{ \frac{\partial f_K}{\partial f_{K-1}}}\color{blue}{ \frac{\partial f_{K-1}}{\partial \theta_{K-2}}} \\ \frac{\partial L}{\partial \theta_{K-3}} & = \frac{\partial L}{\partial f_K}\color{red}{ \frac{\partial f_K}{\partial f_{K-1}}}\color{red}{ \frac{\partial f_{K-1}}{\partial f_{K-2}}}\color{blue}{ \frac{\partial f_{K-2}}{\partial \theta_{K-3}}} \\ \frac{\partial L}{\partial \theta_{K-4}} & = \frac{\partial L}{\partial f_K}\color{red}{ \frac{\partial f_K}{\partial f_{K-1}}}\color{red}{ \frac{\partial f_{K-1}}{\partial f_{K-2}}}\color{red}{ \frac{\partial f_{K-2}}{\partial f_{K-3}}}\color{blue}{ \frac{\partial f_{K-3}}{\partial \theta_{K-4}}} \\ \vdots \end{split} \end{equation}\]

As such, assuming we have already computed \(\frac{\partial L}{\partial \theta_{i+1}}\) it can be reused to compute \(\frac{\partial L}{\partial \theta_{i}}\) - which is a reverse-mode AD.

Autograd lies at the heart of building machine learning projects. It allows automatic calculation of gradient of the custom loss function for the Neural Network with respect to it’s parameters.