Warning ⚠: This is not meant to be a rigouros explanation of automatic differentiation. These must be considered as some notes that I wrote for myself, to better understand the topic of automatic differentiation. Furthermore, these notes are far to be completed and need to be refined.

Several computational techniques (minimization algorithms, training of Deep Neural Networks, Hamiltonian MonteCarlo) requires the computation of gradients, jacobian, and hessians. While we all learnt how to compute these quantities during calculus courses, these techniques may be not well suited when dealing with numerical computations.

In the reminder of this post, I'll walk through the main techniques that can be used to compute derivatives:

Example

Let us consider the following function f(x):R3Rf(\boldsymbol{x}):\mathbb{R}^3\rightarrow\mathbb{R}

f(x1,x2,x3)=sinx1+x1x2+exp(x2+x3) f(x_1,x_2, x_3)= \sin x_1 + x_1 x_2 + \exp(x_2 + x_3)

For instance, if x0=(0,0,0)\boldsymbol{x}_0= (0,0,0), which is the value of the gradient of ff evaluated at x0\boldsymbol{x}_0?

f(x0)=(fx1,fx2,fx3)(x0)=? \nabla f(\boldsymbol{x}_0) = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \frac{\partial f}{\partial x_3}\right)(\boldsymbol{x}_0) = ?

Symbolic approach

This is the classical approach that we all learnt during calculus courses: you simply have to write down the analytical derivatives and compute their values

fx1=cosx1+x2 \frac{\partial f}{\partial x_1} = \cos x_1 + x_2 fx2=x1+exp(x2+x3) \frac{\partial f}{\partial x_2} = x_1 + \exp(x_2+x_3) fx3=exp(x2+x3) \frac{\partial f}{\partial x_3} = \exp(x_2+x_3) f(x)=(cosx1+x2,x1+exp(x2+x3),exp(x2+x3))(x) \nabla f(\boldsymbol{x}) = \left(\cos x_1 + x_2, x_1 + \exp(x_2+x_3), \exp(x_2+x_3)\right)(\boldsymbol{x}) f(x0)=(cos0+0,0+exp(0+0),exp(0+0))=(1,1,1) \nabla f(\boldsymbol{x}_0) = \left(\cos 0 + 0, 0 + \exp(0+0), \exp(0+0)\right) = \left(1,1,1 \right)

Quite easy, isn't it? What are the advantages of this approach?

  • ✅ High precision, since derivatives are analytical

  • ✅ As long as you have an analytical expression for the input function, this approach works

However, this approach has several drawbacks

  • ❌ For complicated functions, the derivative may not have a maneageable analytical expression

  • ❌ We don't always have an analytical functions to differentiate. For instance, when dealing with Differential Equations, we often have only a numerical solution

Some of these problems can be alleviated with the next approach: finite difference derivatives.

Finite difference derivatives

The second approach replaces the limit in the derivative definition with a finite difference derivative:

f(x)=limϵ0f(x+ϵ)f(x)ϵf(x+Δx)f(x)Δx f^{\prime}(x)=\lim _{\epsilon \rightarrow 0} \frac{f(x+\epsilon)-f(x)}{\epsilon}\approx \frac{f(x+\Delta x)-f(x)}{\Delta x}

If we have a function with more variables, this approach need to be extended to each of the variables involved. Let us code it in Julia!

f(x) = sin(x[1])+x[1]*x[2]+exp(x[2]+x[3])

What about the efficiency of this function?

using BenchmarkTools
@benchmark f([0,0,0])

BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  29.312 ns …  2.354 μs  ┊ GC (min … max):  0.00% … 98.55%
 Time  (median):     30.913 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   36.569 ns ± 89.777 ns  ┊ GC (mean ± σ):  10.07% ±  4.05%

  ▂▅█▇▃▂▃▃▂▁▁       ▂▁                                 ▁▁ ▂   ▂
  █████████████▇▇▇▇█████▇▆▆▆▄▆▅▅▅▁▁▄▃▅▅▁▁▄▃▃▄▁▄▃▃▄▁▄▁▄▆█████▇ █
  29.3 ns      Histogram: log(frequency) by time      62.7 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.
Let us evaluate the finite difference derivative![1]

using LinearAlgebra

function  ∇f(f, x, ϵ)
    gradf = zeros(size(x))
    ϵ_matrix = ϵ * Matrix(I, length(x), length(x))
    for i in 1:length(x)
        gradf[i] = (f(x+ϵ_matrix[i,:])-f(x-ϵ_matrix[i,:]))/2ϵ
    end
    return gradf
end

gradf = ∇f(f,[0,0,0], 1e-4)

Let us see the result of the calculation!

∂f1=0.99999999833289
∂f2=1.0000000016668897
∂f3=1.0000000016668897
Nice! We have evaluated the required gradient. However, the result is imprecise: there is a truncation error! We have approximated the derivative and this gave us an imprecise result. Furthermore, the error depends on the chosen step-size. If the step-size is too big, we are not going to approximate the derivative...

gradf = ∇f(f,[0,0,0], 1e-1)

∂f1=0.9983341664682821
∂f2=1.001667500198441
∂f3=1.001667500198441
...on the other hand, a small step-size will incure on floating-precision error

gradf = ∇f(f,[0,0,0], 1e-15)

∂f1=1.0547118733938987
∂f2=1.0547118733938987
∂f3=1.0547118733938987
On the performance side,

@benchmark gradf = ∇f(f,[0,0,0], 1e-4)
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
 Range (min … max):  518.938 ns … 16.102 μs  ┊ GC (min … max):  0.00% … 95.75%
 Time  (median):     541.745 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   641.938 ns ±  1.074 μs  ┊ GC (mean ± σ):  12.34% ±  7.08%

   ▅█                                                           
  ▄██▇▄▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  519 ns          Histogram: frequency by time         1.05 μs <

 Memory estimate: 1.28 KiB, allocs estimate: 16.

Forward algorithmic differentiation

using ForwardDiff
@benchmark ForwardDiff.gradient(f, [0,0,0])
[1.0, 1.0, 1.0]
BenchmarkTools.Trial: 10000 samples with 207 evaluations.
 Range (min … max):  364.353 ns …  17.936 μs  ┊ GC (min … max): 0.00% … 97.60%
 Time  (median):     380.082 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   441.593 ns ± 727.783 ns  ┊ GC (mean ± σ):  7.46% ±  4.44%

  ▅█▆▄▃▂▃▃▂▁▁        ▁▁▁                                        ▁
  █████████████▇▇▆▇▇▇███▇▆▅▆▆▅▅▄▄▄▅▆▅▆▅▆▇▇▇▇▆▆▄▄▂▄▃▃▂▃▃▄▃▃▅▅▅▅▅ █
  364 ns        Histogram: log(frequency) by time        862 ns <

 Memory estimate: 512 bytes, allocs estimate: 5.

Backward algorithmic differentiation

In this part, I'll show a different approach to compute gradient: backward automatic differentiation. I'll write all steps and show them graphically. Disclaimer: I am going to be a bit tedious.

Forward pass

The first step requires the computation of the function. While computing the function, we define some intermediate variables wiw_i and store their values, writing down the Wengert list. Let us start from the first step.

VariableValue
w1w_100
w2w_200
w3w_300

This was an easy one: basically, we simply had to define three variables, corresponding to the three inputs. Let's move forward!

VariableValue
w1w_100
w2w_200
w3w_300
w4w_400
w5w_500
w6w_600

As we can see, we have included some real functions (sin\sin, ++, ...) and combined the previous step.

VariableValue
w1w_100
w2w_200
w3w_300
w4w_400
w5w_500
w6w_600
w7w_711

Let's just keep going on...

VariableValue
w1w_100
w2w_200
w3w_300
w4w_400
w5w_500
w6w_600
w7w_711
w8w_811

...and on...

...and here we are! We have computed the Wengert list and the output of the function. While this look trivial and useless, we have evaluated al quantities required for the next step!

Backward pass

We are now in the position to compute the gradient of the function. Let us start defining the adjoint of a quantity xx, which is mapped in another quantity yy:

xˉ=dydx \bar{x} = \frac{\mathrm{d}y}{\mathrm{d} x}

This quantity and the chain rule will be the key ingredients in the forecoming calculations. Using the chain rule, we will start from the output, yy, and we will pull back the calculations, till we will reach the beginning of the calculation. How can we use the chain rule? The gradient che be rewritten as

dydx=yw8dw8dx \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\frac{\partial y}{\partial w_8}\frac{\mathrm{d}w_8}{\mathrm{d} x}

Since we know the relation between yy and w8w_8 we can compute the partial derivative we added

dydx=dw8dx \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\frac{\mathrm{d}w_8}{\mathrm{d} x}

But now, let's get back to the graph! I'll add in green each calcuation we have been doing

Let's keep going!

dydx=(w8w4dw4dx+w8w5dw5dx+w8w7dw7dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left( \frac{\partial w_8}{\partial w_4}\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\partial w_8}{\partial w_5}\frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\partial w_8}{\partial w_7}\frac{\mathrm{d}w_7}{\mathrm{d} x} \right) dydx=(dw4dx+dw5dx+dw7dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\mathrm{d}w_7}{\mathrm{d} x} \right)

Up to now, everything has been quite easy. Let's continue with the next step!

dydx=(dw4dx+dw5dx+w7w6dw6dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\partial w_7}{\partial w_6}\frac{\mathrm{d}w_6}{\mathrm{d} x} \right) dydx=(dw4dx+dw5dx+exp(w6)dw6dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \exp (w_6)\frac{\mathrm{d}w_6}{\mathrm{d} x} \right)

Finally! Has we can see evaluating the partial derivative requires the value of w6w_6. But we already know this values, since it is stored in the Wengert list!

dydx=(dw4dx+dw5dx+dw6dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_4}{\mathrm{d} x} + \frac{\mathrm{d}w_5}{\mathrm{d} x} + \frac{\mathrm{d}w_6}{\mathrm{d} x} \right)

Now, the meaning of the first step should be clearer: we have evaluated and stored all intermediate quantities. In this way, while moving back along the computationad graph, we already have all quantities required to compute the gradient!

Let's keep going!

dydx=((w4w1dw1dx)+(w5w1dw1dx+w5w2dw2dx)+(w6w2dw2dx+w6w3dw3dx)) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(\frac{\partial w_4}{\partial w_1} \frac{\mathrm{d}w_1}{\mathrm{d} x}\right) + \left(\frac{\partial w_5}{\partial w_1} \frac{\mathrm{d}w_1}{\mathrm{d} x}+\frac{\partial w_5}{\partial w_2} \frac{\mathrm{d}w_2}{\mathrm{d} x}\right) + \left(\frac{\partial w_6}{\partial w_2} \frac{\mathrm{d}w_2}{\mathrm{d} x}+\frac{\partial w_6}{\partial w_3} \frac{\mathrm{d}w_3}{\mathrm{d} x}\right) \right) dydx=((w4w1+w5w1)dw1dx+(w5w2+w6w2)dw2dx+w6w3dw3dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(\frac{\partial w_4}{\partial w_1}+\frac{\partial w_5}{\partial w_1} \right)\frac{\mathrm{d}w_1}{\mathrm{d} x} + \left(\frac{\partial w_5}{\partial w_2}+\frac{\partial w_6}{\partial w_2} \right)\frac{\mathrm{d}w_2}{\mathrm{d} x} + \frac{\partial w_6}{\partial w_3} \frac{\mathrm{d}w_3}{\mathrm{d} x} \right) dydx=((cosw1+w2)dw1dx+(w1+1)dw2dx+1dw3dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(\cos w_1+w_2 \right)\frac{\mathrm{d}w_1}{\mathrm{d} x} + \left(w_1+1 \right)\frac{\mathrm{d}w_2}{\mathrm{d} x} + 1 \frac{\mathrm{d}w_3}{\mathrm{d} x} \right) dydx=((1+0)dw1dx+(0+1)dw2dx+1dw3dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\left(1+0 \right)\frac{\mathrm{d}w_1}{\mathrm{d} x} + \left(0+1 \right)\frac{\mathrm{d}w_2}{\mathrm{d} x} + 1 \frac{\mathrm{d}w_3}{\mathrm{d} x} \right) dydx=(dw1dx+dw2dx+dw3dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\mathrm{d}w_1}{\mathrm{d} x} + \frac{\mathrm{d}w_2}{\mathrm{d} x} + \frac{\mathrm{d}w_3}{\mathrm{d} x} \right)

We are almost there!

dydx=(w1x1dx1dx+w2x2dx2dx+w3x3dx3dx) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\partial w_1}{\partial x_1}\frac{\mathrm{d}x_1}{\mathrm{d} x} + \frac{\partial w_2}{\partial x_2}\frac{\mathrm{d}x_2}{\mathrm{d} x} + \frac{\partial w_3}{\partial x_3}\frac{\mathrm{d}x_3}{\mathrm{d} x} \right) dydx=(w1x1+w2x2+w3x3) \frac{\mathrm{d}y}{\mathrm{d}\boldsymbol{x}}=\left(\frac{\partial w_1}{\partial x_1} + \frac{\partial w_2}{\partial x_2} + \frac{\partial w_3}{\partial x_3}\right)

We have now written the gradient of our function! We three terms, each of the proportional to a partial derivative. The coefficient multiplying each of these derivatives is the corresponding element of the gradient! Thus, we can conclude that the calculation give the same result as before!

References and Footnotes

[1] This is not the most efficient way to code the finite difference derivative, it is just something quick and dirt to show the method. A more efficient implementation can be found in (FiniteDifference.jl)[https://github.com/JuliaDiff/FiniteDifferences.jl]