Several computational techniques (minimization algorithms, training of Deep Neural Networks, Hamiltonian MonteCarlo) requires the computation of gradients, jacobian, and hessians. While we all learnt how to compute these quantities during calculus courses, these techniques may be not well suited when dealing with numerical computations.
In the reminder of this post, I'll walk through the main techniques that can be used to compute derivatives:
Example
Let us consider the following function
For instance, if , which is the value of the gradient of evaluated at ?
Symbolic approach
This is the classical approach that we all learnt during calculus courses: you simply have to write down the analytical derivatives and compute their values
Quite easy, isn't it? What are the advantages of this approach?
✅ High precision, since derivatives are analytical
✅ As long as you have an analytical expression for the input function, this approach works
However, this approach has several drawbacks
❌ For complicated functions, the derivative may not have a maneageable analytical expression
❌ We don't always have an analytical functions to differentiate. For instance, when dealing with Differential Equations, we often have only a numerical solution
Some of these problems can be alleviated with the next approach: finite difference derivatives.
Finite difference derivatives
The second approach replaces the limit in the derivative definition with a finite difference derivative:
If we have a function with more variables, this approach need to be extended to each of the variables involved. Let us code it in Julia!
f(x) = sin(x[1])+x[1]*x[2]+exp(x[2]+x[3])
What about the efficiency of this function?
using BenchmarkTools
@benchmark f([0,0,0])
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max): 30.438 ns … 15.791 μs ┊ GC (min … max): 0.00% … 99.74%
Time (median): 32.629 ns ┊ GC (median): 0.00%
Time (mean ± σ): 51.697 ns ± 379.253 ns ┊ GC (mean ± σ): 17.95% ± 2.44%
▄█▆▅▄▃▂▂▂▁ ▂▄▄▅▅▄▄▃▁ ▂
████████████▇█▇▇█▇▇▇▇▇▆▅▇▇▆▆▆▅▅▆▅▄▄▄▅▄▄▄▄▄▂▆██████████▇▇▇▇█▇ █
30.4 ns Histogram: log(frequency) by time 69.8 ns <
Memory estimate: 80 bytes, allocs estimate: 1.
Let us evaluate the finite difference derivative![1]
using LinearAlgebra
function ∇f(f, x, ϵ)
gradf = zeros(size(x))
ϵ_matrix = ϵ * Matrix(I, length(x), length(x))
for i in 1:length(x)
gradf[i] = (f(x+ϵ_matrix[i,:])-f(x-ϵ_matrix[i,:]))/2ϵ
end
return gradf
end
gradf = ∇f(f,[0,0,0], 1e-4)
Let us see the result of the calculation!
∂f1=0.99999999833289
∂f2=1.0000000016668897
∂f3=1.0000000016668897
Nice! We have evaluated the required gradient. However, the result is imprecise: there is a truncation error! We have approximated the derivative and this gave us an imprecise result. Furthermore, the error depends on the chosen step-size. If the step-size is too big, we are not going to approximate the derivative...
gradf = ∇f(f,[0,0,0], 1e-1)
∂f1=0.9983341664682821
∂f2=1.001667500198441
∂f3=1.001667500198441
...on the other hand, a small step-size will incure on floating-precision error
gradf = ∇f(f,[0,0,0], 1e-15)
∂f1=1.0547118733938987
∂f2=1.0547118733938987
∂f3=1.0547118733938987
On the performance side,
@benchmark gradf = ∇f(f,[0,0,0], 1e-4)
BenchmarkTools.Trial: 10000 samples with 191 evaluations.
Range (min … max): 522.330 ns … 83.213 μs ┊ GC (min … max): 0.00% … 99.18%
Time (median): 554.097 ns ┊ GC (median): 0.00%
Time (mean ± σ): 825.066 ns ± 3.434 μs ┊ GC (mean ± σ): 18.08% ± 4.31%
▅█▇▆▅▄▄▃▃▂▂▂▂▁▁▁▁ ▁ ▁ ▁▄▅▅▄▃▂▁ ▂
██████████████████████▇▇▇▇▆▆▆▅▄▅▅▅▄▅▅▃▁▁▃▄▃▁▃▃▁▁▄▃▅█████████ █
522 ns Histogram: log(frequency) by time 1.11 μs <
Memory estimate: 1.28 KiB, allocs estimate: 16.
Forward algorithmic differentiation
using ForwardDiff
@benchmark ForwardDiff.gradient(f, [0,0,0])
[1.0, 1.0, 1.0]
BenchmarkTools.Trial: 10000 samples with 214 evaluations.
Range (min … max): 347.706 ns … 74.440 μs ┊ GC (min … max): 0.00% … 99.25%
Time (median): 368.958 ns ┊ GC (median): 0.00%
Time (mean ± σ): 473.245 ns ± 1.934 μs ┊ GC (mean ± σ): 10.80% ± 2.63%
▂▇█▇▅▄▄▂▂▂▂▁▁▁▁ ▁ ▁ ▁▁ ▁▃▄▅▅▅▅▄▄▃▂▂▁▁ ▂
████████████████████████████▇▆▇▇▆▆▅▅▅▆▆▅▆█████████████████▇▇ █
348 ns Histogram: log(frequency) by time 565 ns <
Memory estimate: 512 bytes, allocs estimate: 5.
Backward algorithmic differentiation
In this part, I'll show a different approach to compute gradient: backward automatic differentiation. I'll write all steps and show them graphically. Disclaimer: I am going to be a bit tedious.
Forward pass
The first step requires the computation of the function. While computing the function, we define some intermediate variables and store their values, writing down the Wengert list. Let us start from the first step.
Variable | Value |
---|---|
This was an easy one: basically, we simply had to define three variables, corresponding to the three inputs. Let's move forward!
Variable | Value |
---|---|
As we can see, we have included some real functions (, , ...) and combined the previous step.
Variable | Value |
---|---|
Let's just keep going on...
Variable | Value |
---|---|
...and on...
...and here we are! We have computed the Wengert list and the output of the function. While this look trivial and useless, we have evaluated al quantities required for the next step!
Backward pass
We are now in the position to compute the gradient of the function. Let us start defining the adjoint of a quantity , which is mapped in another quantity :
This quantity and the chain rule will be the key ingredients in the forecoming calculations. Using the chain rule, we will start from the output, , and we will pull back the calculations, till we will reach the beginning of the calculation. How can we use the chain rule? The gradient che be rewritten as
Since we know the relation between and we can compute the partial derivative we added
But now, let's get back to the graph! I'll add in green each calcuation we have been doing
Let's keep going!
Up to now, everything has been quite easy. Let's continue with the next step!
Finally! Has we can see evaluating the partial derivative requires the value of . But we already know this values, since it is stored in the Wengert list!
Now, the meaning of the first step should be clearer: we have evaluated and stored all intermediate quantities. In this way, while moving back along the computationad graph, we already have all quantities required to compute the gradient!
Let's keep going!
We are almost there!
We have now written the gradient of our function! We three terms, each of the proportional to a partial derivative. The coefficient multiplying each of these derivatives is the corresponding element of the gradient! Thus, we can conclude that the calculation give the same result as before!
References and Footnotes
[1] | This is not the most efficient way to code the finite difference derivative, it is just something quick and dirt to show the method. A more efficient implementation can be found in (FiniteDifference.jl)[https://github.com/JuliaDiff/FiniteDifferences.jl] |