Several computational techniques (minimization algorithms, training of Deep Neural Networks, Hamiltonian MonteCarlo) requires the computation of gradients, jacobian, and hessians. While we all learnt how to compute these quantities during calculus courses, these techniques may be not well suited when dealing with numerical computations.
In the reminder of this post, I'll walk through the main techniques that can be used to compute derivatives:
Example
Let us consider the following function
For instance, if , which is the value of the gradient of evaluated at ?
Symbolic approach
This is the classical approach that we all learnt during calculus courses: you simply have to write down the analytical derivatives and compute their values
Quite easy, isn't it? What are the advantages of this approach?
✅ High precision, since derivatives are analytical
✅ As long as you have an analytical expression for the input function, this approach works
However, this approach has several drawbacks
❌ For complicated functions, the derivative may not have a maneageable analytical expression
❌ We don't always have an analytical functions to differentiate. For instance, when dealing with Differential Equations, we often have only a numerical solution
Some of these problems can be alleviated with the next approach: finite difference derivatives.
Finite difference derivatives
The second approach replaces the limit in the derivative definition with a finite difference derivative:
If we have a function with more variables, this approach need to be extended to each of the variables involved. Let us code it in Julia!
f(x) = sin(x[1])+x[1]*x[2]+exp(x[2]+x[3])
What about the efficiency of this function?
using BenchmarkTools
@benchmark f([0,0,0])
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max): 29.312 ns … 2.354 μs ┊ GC (min … max): 0.00% … 98.55%
Time (median): 30.913 ns ┊ GC (median): 0.00%
Time (mean ± σ): 36.569 ns ± 89.777 ns ┊ GC (mean ± σ): 10.07% ± 4.05%
▂▅█▇▃▂▃▃▂▁▁ ▂▁ ▁▁ ▂ ▂
█████████████▇▇▇▇█████▇▆▆▆▄▆▅▅▅▁▁▄▃▅▅▁▁▄▃▃▄▁▄▃▃▄▁▄▁▄▆█████▇ █
29.3 ns Histogram: log(frequency) by time 62.7 ns <
Memory estimate: 80 bytes, allocs estimate: 1.
Let us evaluate the finite difference derivative![1]
using LinearAlgebra
function ∇f(f, x, ϵ)
gradf = zeros(size(x))
ϵ_matrix = ϵ * Matrix(I, length(x), length(x))
for i in 1:length(x)
gradf[i] = (f(x+ϵ_matrix[i,:])-f(x-ϵ_matrix[i,:]))/2ϵ
end
return gradf
end
gradf = ∇f(f,[0,0,0], 1e-4)
Let us see the result of the calculation!
∂f1=0.99999999833289
∂f2=1.0000000016668897
∂f3=1.0000000016668897
Nice! We have evaluated the required gradient. However, the result is imprecise: there is a truncation error! We have approximated the derivative and this gave us an imprecise result. Furthermore, the error depends on the chosen step-size. If the step-size is too big, we are not going to approximate the derivative...
gradf = ∇f(f,[0,0,0], 1e-1)
∂f1=0.9983341664682821
∂f2=1.001667500198441
∂f3=1.001667500198441
...on the other hand, a small step-size will incure on floating-precision error
gradf = ∇f(f,[0,0,0], 1e-15)
∂f1=1.0547118733938987
∂f2=1.0547118733938987
∂f3=1.0547118733938987
On the performance side,
@benchmark gradf = ∇f(f,[0,0,0], 1e-4)
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
Range (min … max): 518.938 ns … 16.102 μs ┊ GC (min … max): 0.00% … 95.75%
Time (median): 541.745 ns ┊ GC (median): 0.00%
Time (mean ± σ): 641.938 ns ± 1.074 μs ┊ GC (mean ± σ): 12.34% ± 7.08%
▅█
▄██▇▄▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
519 ns Histogram: frequency by time 1.05 μs <
Memory estimate: 1.28 KiB, allocs estimate: 16.
Forward algorithmic differentiation
using ForwardDiff
@benchmark ForwardDiff.gradient(f, [0,0,0])
[1.0, 1.0, 1.0]
BenchmarkTools.Trial: 10000 samples with 207 evaluations.
Range (min … max): 364.353 ns … 17.936 μs ┊ GC (min … max): 0.00% … 97.60%
Time (median): 380.082 ns ┊ GC (median): 0.00%
Time (mean ± σ): 441.593 ns ± 727.783 ns ┊ GC (mean ± σ): 7.46% ± 4.44%
▅█▆▄▃▂▃▃▂▁▁ ▁▁▁ ▁
█████████████▇▇▆▇▇▇███▇▆▅▆▆▅▅▄▄▄▅▆▅▆▅▆▇▇▇▇▆▆▄▄▂▄▃▃▂▃▃▄▃▃▅▅▅▅▅ █
364 ns Histogram: log(frequency) by time 862 ns <
Memory estimate: 512 bytes, allocs estimate: 5.
Backward algorithmic differentiation
In this part, I'll show a different approach to compute gradient: backward automatic differentiation. I'll write all steps and show them graphically. Disclaimer: I am going to be a bit tedious.
Forward pass
The first step requires the computation of the function. While computing the function, we define some intermediate variables and store their values, writing down the Wengert list. Let us start from the first step.
Variable | Value |
---|---|
This was an easy one: basically, we simply had to define three variables, corresponding to the three inputs. Let's move forward!
Variable | Value |
---|---|
As we can see, we have included some real functions (, , ...) and combined the previous step.
Variable | Value |
---|---|
Let's just keep going on...
Variable | Value |
---|---|
...and on...
...and here we are! We have computed the Wengert list and the output of the function. While this look trivial and useless, we have evaluated al quantities required for the next step!
Backward pass
We are now in the position to compute the gradient of the function. Let us start defining the adjoint of a quantity , which is mapped in another quantity :
This quantity and the chain rule will be the key ingredients in the forecoming calculations. Using the chain rule, we will start from the output, , and we will pull back the calculations, till we will reach the beginning of the calculation. How can we use the chain rule? The gradient che be rewritten as
Since we know the relation between and we can compute the partial derivative we added
But now, let's get back to the graph! I'll add in green each calcuation we have been doing
Let's keep going!
Up to now, everything has been quite easy. Let's continue with the next step!
Finally! Has we can see evaluating the partial derivative requires the value of . But we already know this values, since it is stored in the Wengert list!
Now, the meaning of the first step should be clearer: we have evaluated and stored all intermediate quantities. In this way, while moving back along the computationad graph, we already have all quantities required to compute the gradient!
Let's keep going!
We are almost there!
We have now written the gradient of our function! We three terms, each of the proportional to a partial derivative. The coefficient multiplying each of these derivatives is the corresponding element of the gradient! Thus, we can conclude that the calculation give the same result as before!
References and Footnotes
[1] | This is not the most efficient way to code the finite difference derivative, it is just something quick and dirt to show the method. A more efficient implementation can be found in (FiniteDifference.jl)[https://github.com/JuliaDiff/FiniteDifferences.jl] |