How to count and cache function calls
Ronny Bergmann
In this tutorial, we want to investigate the caching and counting (statistics) features of Manopt.jl. We will reuse the optimization tasks from the introductory tutorial Get started: optimize!.
Introduction
There are surely many ways to keep track for example of how often the cost function is called, for example with a functor, as we used in an example in How to Record Data
mutable struct MyCost{I<:Integer}
count::I
end
MyCost() = MyCost{Int64}(0)
function (c::MyCost)(M, x)
c.count += 1
# [ .. Actual implementation of the cost here ]
end
This still leaves a bit of work to the user, especially for tracking more than just the number of cost function evaluations.
When a function like the objective or gradient is expensive to compute, it may make sense to cache its results. Manopt.jl tries to minimize the number of repeated calls but sometimes they are necessary and harmless when the function is cheap to compute. Caching of expensive function calls can for example be added using Memoize.jl by the user. The approach in the solvers of Manopt.jl aims to simplify adding both these capabilities on the level of calling a solver.
Technical Background
The two ingredients for a solver in Manopt.jl are the AbstractManoptProblem
and the AbstractManoptSolverState
, where the former consists of the domain, that is the AsbtractManifold
and AbstractManifoldObjective
.
Both recording and debug capabilities are implemented in a decorator pattern to the solver state. They can be easily added using the record=
and debug=
in any solver call. This pattern was recently extended, such that also the objective can be decorated. This is how both caching and counting are implemented, as decorators of the AbstractManifoldObjective
and hence for example changing/extending the behaviour of a call to get_cost
.
Let’s finish off the technical background by loading the necessary packages. Besides Manopt.jl and Manifolds.jl we also need LRUCaches.jl which are (since Julia 1.9) a weak dependency and provide the least recently used strategy for our caches.
using Manopt, Manifolds, Random, LRUCache, LinearAlgebra, ManifoldDiff
using ManifoldDiff: grad_distance
Counting
We first define our task, the Riemannian Center of Mass from the Get started: optimize! tutorial.
n = 100
σ = π / 8
M = Sphere(2)
p = 1 / sqrt(2) * [1.0, 0.0, 1.0]
Random.seed!(42)
data = [exp(M, p, σ * rand(M; vector_at=p)) for i in 1:n];
f(M, p) = sum(1 / (2 * n) * distance.(Ref(M), Ref(p), data) .^ 2)
grad_f(M, p) = sum(1 / n * grad_distance.(Ref(M), data, Ref(p)));
to now count how often the cost and the gradient are called, we use the count=
keyword argument that works in any solver to specify the elements of the objective whose calls we want to count calls to. A full list is available in the documentation of the AbstractManifoldObjective
. To also see the result, we have to set return_objective=true
. This returns (objective, p)
instead of just the solver result p
. We can further also set return_state=true
to get even more information about the solver run.
gradient_descent(M, f, grad_f, data[1]; count=[:Cost, :Gradient], return_objective=true, return_state=true)
# Solver state for `Manopt.jl`s Gradient Descent
After 66 iterations
## Parameters
* retraction method: ExponentialRetraction()
## Stepsize
ArmijoLinesearch() with keyword parameters
* initial_stepsize = 1.0
* retraction_method = ExponentialRetraction()
* contraction_factor = 0.95
* sufficient_decrease = 0.1
## Stopping criterion
Stop When _one_ of the following are fulfilled:
Max Iteration 200: not reached
|grad f| < 1.0e-8: reached
Overall: reached
This indicates convergence: Yes
## Statistics on function calls
* :Gradient : 199
* :Cost : 275
And we see that statistics are shown in the end.
Caching
To now also cache these calls, we can use the cache=
keyword argument. Since now both the cache and the count “extend” the functionality of the objective, the order is important: on the high-level interface, the count
is treated first, which means that only actual function calls and not cache look-ups are counted. With the proper initialisation, you can use any caches here that support the get!(function, cache, key)!
update. All parts of the objective that can currently be cached are listed at ManifoldCachedObjective
. The solver call has a keyword cache
that takes a tuple(c, vs, n)
of three arguments, where c
is a symbol for the type of cache, vs
is a vector of symbols, which calls to cache and n
is the size of the cache. If the last element is not provided, a suitable default (currentlyn=10
) is used.
Here we want to use c=:LRU
caches for vs=[Cost, :Gradient]
with a size of n=25
.
r = gradient_descent(M, f, grad_f, data[1];
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true, return_state=true)
# Solver state for `Manopt.jl`s Gradient Descent
After 66 iterations
## Parameters
* retraction method: ExponentialRetraction()
## Stepsize
ArmijoLinesearch() with keyword parameters
* initial_stepsize = 1.0
* retraction_method = ExponentialRetraction()
* contraction_factor = 0.95
* sufficient_decrease = 0.1
## Stopping criterion
Stop When _one_ of the following are fulfilled:
Max Iteration 200: not reached
|grad f| < 1.0e-8: reached
Overall: reached
This indicates convergence: Yes
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 66
* :Cost : 149
Since the default setup with ArmijoLinesearch
needs the gradient and the cost, and similarly the stopping criterion might (independently) evaluate the gradient, the caching is quite helpful here.
And of course also for this advanced return value of the solver, we can still access the result as usual:
get_solver_result(r)
3-element Vector{Float64}:
0.6868392807355564
0.006531599748261925
0.7267799809043942
Advanced Caching Examples
There are more options other than caching single calls to specific parts of the objective. For example you may want to cache intermediate results of computing the cost and share that with the gradient computation. We will present three solutions to this:
- An easy approach from within
Manopt.jl
: theManifoldCostGradientObjective
- A shared storage approach using a functor
- A shared (internal) cache approach also using a functor
For that we switch to another example: The Rayleigh quotient. We aim to maximize the Rayleigh quotient $\displaystyle\frac{x^{\mathrm{T}}Ax}{x^{\mathrm{T}}x}$, for some $A∈ℝ^{m+1\times m+1}$ and $x∈ℝ^{m+1}$ but since we consider this on the sphere and Manopt.jl
(as many other optimization toolboxes) minimizes, we consider
\[g(p) = -p^{\mathrm{T}}Ap,\qquad p∈\mathbb S^{m}\]
The Euclidean gradient (that is in $ R^{m+1}$) is actually just $\nabla g(p) = -2Ap$, the Riemannian gradient the projection of $\nabla g(p)$ onto the tangent space $T_p\mathbb S^{m}$.
m = 25
Random.seed!(42)
A = randn(m + 1, m + 1)
A = Symmetric(A)
p_star = eigvecs(A)[:, end] # minimizer (or similarly -p)
f_star = -eigvals(A)[end] # cost (note that we get - the largest Eigenvalue)
N = Sphere(m);
g(M, p) = -p' * A*p
∇g(p) = -2 * A * p
grad_g(M,p) = project(M, p, ∇g(p))
grad_g!(M,X, p) = project!(M, X, p, ∇g(p))
grad_g! (generic function with 1 method)
But since both the cost and the gradient require the computation of the matrix-vector product $Ap$, it might be beneficial to only compute this once.
The ManifoldCostGradientObjective
approach
The ManifoldCostGradientObjective
uses a combined function to compute both the gradient and the cost at the same time. We define the in-place variant as
function g_grad_g!(M::AbstractManifold, X, p)
X .= -A*p
c = p'*X
X .*= 2
project!(M, X, p, X)
return (c, X)
end
g_grad_g! (generic function with 1 method)
where we only compute the matrix-vector product once. The small disadvantage might be, that we always compute both, the gradient and the cost. Luckily, the cache we used before, takes this into account and caches both results, such that we indeed end up computing A*p
only once when asking to a cost and a gradient.
Let’s compare both methods
p0 = [(1/5 .* ones(5))..., zeros(m-4)...];
@time s1 = gradient_descent(N, g, grad_g!, p0;
stopping_criterion = StopWhenGradientNormLess(1e-5),
evaluation=InplaceEvaluation(),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true,
)
0.901452 seconds (1.05 M allocations: 71.294 MiB, 0.92% gc time, 99.46% compilation time)
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 602
* :Cost : 1449
To access the solver result, call `get_solver_result` on this variable.
versus
obj = ManifoldCostGradientObjective(g_grad_g!; evaluation=InplaceEvaluation())
@time s2 = gradient_descent(N, obj, p0;
stopping_criterion=StopWhenGradientNormLess(1e-5),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true,
)
0.739526 seconds (710.86 k allocations: 56.006 MiB, 2.05% gc time, 97.89% compilation time)
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 1448
* :Cost : 1448
To access the solver result, call `get_solver_result` on this variable.
first of all both yield the same result
p1 = get_solver_result(s1)
p2 = get_solver_result(s2)
[distance(N, p1, p2), g(N, p1), g(N, p2), f_star]
4-element Vector{Float64}:
0.0
-7.8032957637779
-7.8032957637779
-7.803295763793949
and we can see that the combined number of evaluations is once 2051, once just the number of cost evaluations 1449. Note that the involved additional 847 gradient evaluations are merely a multiplication with 2. On the other hand, the additional caching of the gradient in these cases might be less beneficial. It is beneficial, when the gradient and the cost are very often required together.
A shared storage approach using a functor
An alternative to the previous approach is the usage of a functor that introduces a “shared storage” of the result of computing A*p
. We additionally have to store p
though, since we have to check that we are still evaluating the cost and/or gradient at the same point at which the cached A*p
was computed. We again consider the (more efficient) in-place variant. This can be done as follows
struct StorageG{T,M}
A::M
Ap::T
p::T
end
function (g::StorageG)(::Val{:Cost}, M::AbstractManifold, p)
if !(p==g.p) #We are at a new point -> Update
g.Ap .= g.A*p
g.p .= p
end
return -g.p'*g.Ap
end
function (g::StorageG)(::Val{:Gradient}, M::AbstractManifold, X, p)
if !(p==g.p) #We are at a new point -> Update
g.Ap .= g.A*p
g.p .= p
end
X .= -2 .* g.Ap
project!(M, X, p, X)
return X
end
Here we use the first parameter to distinguish both functions. For the mutating case the signatures are different regardless of the additional argument but for the allocating case, the signatures of the cost and the gradient function are the same.
#Define the new functor
storage_g = StorageG(A, zero(p0), zero(p0))
# and cost and gradient that use this functor as
g3(M,p) = storage_g(Val(:Cost), M, p)
grad_g3!(M, X, p) = storage_g(Val(:Gradient), M, X, p)
@time s3 = gradient_descent(N, g3, grad_g3!, p0;
stopping_criterion = StopWhenGradientNormLess(1e-5),
evaluation=InplaceEvaluation(),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 2),
return_objective=true#, return_state=true
)
0.426491 seconds (301.58 k allocations: 21.769 MiB, 98.94% compilation time)
## Cache
* :Cost : 2/2 entries of type Float64 used
* :Gradient : 2/2 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 602
* :Cost : 1449
To access the solver result, call `get_solver_result` on this variable.
This of course still yields the same result
p3 = get_solver_result(s3)
g(N, p3) - f_star
1.6049384043981263e-11
And while we again have a split off the cost and gradient evaluations, we can observe that the allocations are less than half of the previous approach.
A local cache approach
This variant is very similar to the previous one, but uses a whole cache instead of just one place to store A*p
. This makes the code a bit nicer, and it is possible to store more than just the last p
either cost or gradient was called with.
struct CacheG{C,M}
A::M
cache::C
end
function (g::CacheG)(::Val{:Cost}, M, p)
Ap = get!(g.cache, copy(M,p)) do
g.A*p
end
return -p'*Ap
end
function (g::CacheG)(::Val{:Gradient}, M, X, p)
Ap = get!(g.cache, copy(M,p)) do
g.A*p
end
X .= -2 .* Ap
project!(M, X, p, X)
return X
end
However, the resulting solver run is not always faster, since the whole cache instead of storing just Ap
and p
is a bit more costly. Then the tradeoff is, whether this pays off.
#Define the new functor
cache_g = CacheG(A, LRU{typeof(p0),typeof(p0)}(; maxsize=25))
# and cost and gradient that use this functor as
g4(M,p) = cache_g(Val(:Cost), M, p)
grad_g4!(M, X, p) = cache_g(Val(:Gradient), M, X, p)
@time s4 = gradient_descent(N, g4, grad_g4!, p0;
stopping_criterion = StopWhenGradientNormLess(1e-5),
evaluation=InplaceEvaluation(),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true,
)
0.343276 seconds (302.50 k allocations: 22.214 MiB, 98.27% compilation time)
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 602
* :Cost : 1449
To access the solver result, call `get_solver_result` on this variable.
and for safety let’s check that we are reasonably close
p4 = get_solver_result(s4)
g(N, p4) - f_star
1.6049384043981263e-11
For this example, or maybe even gradient_descent
in general it seems, this additional (second, inner) cache does not improve the result further, it is about the same effort both time and allocation-wise.
Summary
While the second approach of ManifoldCostGradientObjective
is very easy to implement, both the storage and the (local) cache approach are more efficient. All three are an improvement over the first implementation without sharing interim results. The results with storage or cache have further advantage of being more flexible, since the stored information could also be reused in a third function, for example when also computing the Hessian.