The Rayleigh Quotient

Ronny Bergmann 2024-03-09

Introduction

This example reproduces a few conceptual ideas of Optimization on Manifolds that are used throughout [Bou23] using the Rayleigh quotient and explores several different ways to use the algorithms from Manopt.jl.

For a symmetric matrix $A \in \mathbb R^{n\times n}$ we consider the 📖 Rayleigh Quotient

\[\operatorname*{arg\,min}_{x \in \mathbb R^n \backslash \{0\}} \frac{x^{\mathrm{T}}Ax}{\lVert x \rVert^2}.\]

On the sphere we can omit the denominator and obtain

\[f(p) = p^{\mathrm{T}}Ap,\qquad p ∈ 𝕊^{n-1},\]

which by itself we can again continue in the embedding as

\[\tilde f(x) = x^{\mathrm{T}}Ax,\qquad x \in \mathbb R^n.\]

This cost has the nice feature that at the minimizer $p^*\in\mathbb S^{n-1}$ the function falue $f(p^*)$ is the smalles eigenvalue of $A$.

For the embedded function $\tilde f$ the gradient and Hessian can be computed with classical methods as

\[\begin{align*} ∇\tilde f(x) &= 2Ax, \qquad x ∈ ℝ^n, \\ ∇^2\tilde f(x)[V] &= 2AV, \qquad x, V ∈ ℝ^n. \end{align*}\]

Similarly, cf. Examples 3.62 and 5.27 of [Bou23], the Riemannian gradient and Hessian on the manifold $\mathcal M = \mathbb S^{n-1}$ are given by

\[\begin{align*} \operatorname{grad} f(p) &= 2Ap - 2(p^{\mathrm{T}}Ap)*p,\qquad p ∈ 𝕊^{n-1}, \\ \operatorname{Hess} f(p)[X] &= 2AX - 2(p^{\mathrm{T}}AX)p - 2(p^{\mathrm{T}}Ap)X,\qquad p ∈ 𝕊^{n-1}, X \in T_p𝕊^{n-1} \end{align*}\]

Let’s first generate an example martrx $A$.

using Pkg;
cd(@__DIR__)
Pkg.activate("."); # use the example environment,
using LRUCache, BenchmarkTools, LinearAlgebra, Manifolds, ManoptExamples, Manopt, Random
Random.seed!(42)
n = 500
A = Symmetric(randn(n, n) / n)

And the manifolds

M = Sphere(n-1)
Sphere(499, ℝ)
E = get_embedding(M)
Euclidean(500; field=ℝ)

Setup the corresponding functions

Since RayleighQuotientCost, RayleighQuotientGrad!!, and RayleighQuotientHess!! are themselves manifold agnostic we only need to initialize them once. Agnostic here means that they would compute $f$ is called with M as their first argument and $\tilde f$ if called with E.

We instantiate

f = ManoptExamples.RayleighQuotientCost(A)
grad_f = ManoptExamples.RayleighQuotientGrad!!(A)
Hess_f = ManoptExamples.RayleighQuotientHess!!(A)

the suffix !! also indicates that these functions both work as allocating and in-place variants. Given a starting point and some memory

p0 = [1.0, zeros(n-1)...]
X = zero_vector(M, p0)

we can both call

Y = grad_f(M, p0)  # Allocates memory
grad_f(M, X, p0)    # Computes in place of X and returns the result in X.
norm(M, p0, X-Y)
0.0

Now we can use a few different variants of solvers to approaach this and this tutorial will walk you through a few of them.

First of all let’s construct the actual result – since Rayleigh quotient minimization is not necessarily the best way to compute the smallest Eigenvalue. We can also compute

λ = min(eigvals(A)...)
-0.08967721009388108

A Solver based on gradient information

Let’s first just use first-order information and since we are just starting, maybe we only derived the Euclidean gradient $\nabla \tilde f$. We can “tell” the solver, that the provided function and the gradient are defined as the Euclidean variants in the embedding. internally, Manopt.jl then issues the conversion for Euclidean gradients to the corresponding Riemannian one, cf. e.g. this tutorial section or Section 3.8 or more precisely Example 3.62 in [Bou23].

But instead of diving into all the tecnical details, we can just specify objective_type=:Euclidean to trigger the conversion. We start with a simple gradient descent

s = gradient_descent(M, f, grad_f, p0; objective_type=:Euclidean,
    debug = [:Iteration, :Cost, :GradientNorm, 50, "\n"],
    return_state=true,
)
q1 = get_solver_result(s)
s
Initial f(x): -0.000727
# 50    f(x): -0.088415|grad f(p)|:0.004530500043902619
# 100   f(x): -0.089097|grad f(p)|:0.004589417101266096
# 150   f(x): -0.089530|grad f(p)|:0.0026028331895358247
# 200   f(x): -0.089650|grad f(p)|:0.0012359084298719039

# Solver state for `Manopt.jl`s Gradient Descent
After 200 iterations

## Parameters
* retraction method: ExponentialRetraction()

## Stepsize
ArmijoLinesearch() with keyword parameters
  * initial_stepsize    = 1.0
  * retraction_method   = ExponentialRetraction()
  * contraction_factor  = 0.95
  * sufficient_decrease = 0.1

## Stopping criterion

Stop When _one_ of the following are fulfilled:
    Max Iteration 200:  reached
    |grad f| < 1.0e-8: not reached
Overall: reached
This indicates convergence: No

## Debug
    :Iteration = [(:Iteration, "# %-6d"), (:Cost, "f(x): %f"), (:GradientNorm, "|grad f(p)|:%s"), "\n", 50]

From the final cost we can already see that q1 is an eigenvector to the smallest eigenvalue we obtaines above.

And we can compare this to running with the Riemannian gradient, since the RayleighQuotientGrad!! returns this one as well, when just called with the sphere as first Argument, we just have to remove the objective_type.

q2 = gradient_descent(M, f, grad_f, p0;
    debug = [:Iteration, :Cost, :GradientNorm, 50, "\n"],
)
#Test that both are the same
isapprox(M, q1,q2)
Initial f(x): -0.000727
# 50    f(x): -0.088415|grad f(p)|:0.004530500043902567
# 100   f(x): -0.089097|grad f(p)|:0.004589417101266063
# 150   f(x): -0.089530|grad f(p)|:0.002602833189535808
# 200   f(x): -0.089650|grad f(p)|:0.0012359084298719097

true

We can also benchmark both

@benchmark gradient_descent($M, $f, $grad_f, $p0; objective_type=:Euclidean)
BenchmarkTools.Trial: 23 samples with 1 evaluation.
 Range (min … max):  218.625 ms … 280.105 ms  ┊ GC (min … max): 3.46% … 3.93%
 Time  (median):     220.184 ms               ┊ GC (median):    3.49%
 Time  (mean ± σ):   225.401 ms ±  13.335 ms  ┊ GC (mean ± σ):  3.54% ± 0.27%

  ▆█                                                             
  ███▄▁▄▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  219 ms           Histogram: frequency by time          280 ms <

 Memory estimate: 1.13 GiB, allocs estimate: 3607.
@benchmark gradient_descent($M, $f, $grad_f, $p0)
BenchmarkTools.Trial: 155 samples with 1 evaluation.
 Range (min … max):  29.784 ms … 165.193 ms  ┊ GC (min … max): 0.00% … 81.19%
 Time  (median):     30.166 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.288 ms ±  11.060 ms  ┊ GC (mean ± σ):  3.48% ±  7.06%

  █                                                             
  █▄▃▆▃▃▃▃▂▂▁▁▁▁▂▁▁▁▁▃▅▃▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  29.8 ms         Histogram: frequency by time           46 ms <

 Memory estimate: 11.38 MiB, allocs estimate: 3000.

From these results we see, that the conversion from the Euclidean to the Riemannian gradient does require a small amount of effort and hence reduces the performance slighly. Still, if the Euclidean Gradient is easier to compute or already available, this is in terms of coding the faster way. Finally this is a tradeoff between derivation and implementation efforts for the Riemannian gradient and a slight performance reduction when using the Euclidean one.

A Solver based (also) on (approximate) Hessian information

To also involve the Hessian, we consider the trust regions solver with three cases:

  1. Euclidean, approximating the Hessian
  2. Euclidean, providing the Hessian
  3. Riemannian, providing the Hessian but also using in-place evaluations.
q3 = trust_regions(M, f, grad_f, p0; objective_type=:Euclidean,
    debug = [:Iteration, :Cost, :GradientNorm, 10, "\n"],
);
Initial f(x): -0.000727
# 10    f(x): -0.088106|grad f(p)|:0.01903913659588686
# 20    f(x): -0.089023|grad f(p)|:0.007792334296299116
# 30    f(x): -0.089501|grad f(p)|:0.008034300330026467
# 40    f(x): -0.089842|grad f(p)|:0.008125526728200166
# 50    f(x): -0.089890|grad f(p)|:0.0031244752821335416
# 60    f(x): -0.089925|grad f(p)|:0.0029682862637714163
# 70    f(x): -0.089962|grad f(p)|:0.002811722437216778
# 80    f(x): -0.089997|grad f(p)|:0.0026658493010157363
# 90    f(x): -0.090032|grad f(p)|:0.0025418974797659266
# 100   f(x): -0.090067|grad f(p)|:0.0024485809550738955
# 110   f(x): -0.090108|grad f(p)|:0.0023894008071780747
# 120   f(x): -0.090155|grad f(p)|:0.002362317662908117
# 130   f(x): -0.090208|grad f(p)|:0.0023611301647631484
# 140   f(x): -0.090262|grad f(p)|:0.00237797866404072
# 150   f(x): -0.090314|grad f(p)|:0.002405563029627607
# 160   f(x): -0.090362|grad f(p)|:0.002438250821406204
# 170   f(x): -0.090404|grad f(p)|:0.002472221074327323
# 180   f(x): -0.090441|grad f(p)|:0.0025051377726827166
# 190   f(x): -0.090472|grad f(p)|:0.002535721310831389
# 200   f(x): -0.090498|grad f(p)|:0.0025633813700434637
# 210   f(x): -0.090513|grad f(p)|:0.0025832821804127513
# 220   f(x): -0.090513|grad f(p)|:0.0025832821804127513
# 230   f(x): -0.090513|grad f(p)|:0.0025832821792817493
# 240   f(x): -0.090513|grad f(p)|:0.0025832821770197098
# 250   f(x): -0.090513|grad f(p)|:0.0025832821747576924
# 260   f(x): -0.090513|grad f(p)|:0.002583282172495683
# 270   f(x): -0.090513|grad f(p)|:0.0025832821702336567
# 280   f(x): -0.090513|grad f(p)|:0.002583282167971658
# 290   f(x): -0.090513|grad f(p)|:0.002583282165709656
# 300   f(x): -0.090513|grad f(p)|:0.002583282163447637
# 310   f(x): -0.090513|grad f(p)|:0.0025832821611855928
# 320   f(x): -0.090513|grad f(p)|:0.0025832821589235814
# 330   f(x): -0.090513|grad f(p)|:0.002583282156661572
# 340   f(x): -0.090513|grad f(p)|:0.0025832821543995727
# 350   f(x): -0.090513|grad f(p)|:0.002583282152137569
# 360   f(x): -0.090513|grad f(p)|:0.0025832821498755487
# 370   f(x): -0.090513|grad f(p)|:0.0025832821476135036
# 380   f(x): -0.090513|grad f(p)|:0.0025832821453515035
# 390   f(x): -0.090513|grad f(p)|:0.0025832821430894675
# 400   f(x): -0.090513|grad f(p)|:0.0025832821408274405
# 410   f(x): -0.090513|grad f(p)|:0.002583282138565445
# 420   f(x): -0.090513|grad f(p)|:0.002583282136303441
# 430   f(x): -0.090513|grad f(p)|:0.00258328213404143
# 440   f(x): -0.090513|grad f(p)|:0.002583282131779385
# 450   f(x): -0.090513|grad f(p)|:0.0025832821295174104
# 460   f(x): -0.090513|grad f(p)|:0.002583282127255372
# 470   f(x): -0.090513|grad f(p)|:0.002583282124993372
# 480   f(x): -0.090513|grad f(p)|:0.0025832821227313313
# 490   f(x): -0.090513|grad f(p)|:0.0025832821204693065
# 500   f(x): -0.090513|grad f(p)|:0.002583282118207321
# 510   f(x): -0.090513|grad f(p)|:0.0025832821159453034
# 520   f(x): -0.090513|grad f(p)|:0.0025832821136832665
# 530   f(x): -0.090513|grad f(p)|:0.0025832821114212673
# 540   f(x): -0.090513|grad f(p)|:0.002583282109159243
# 550   f(x): -0.090513|grad f(p)|:0.002583282106897217
# 560   f(x): -0.090513|grad f(p)|:0.002583282104635213
# 570   f(x): -0.090513|grad f(p)|:0.0025832821023731955
# 580   f(x): -0.090513|grad f(p)|:0.0025832821001112094
# 590   f(x): -0.090513|grad f(p)|:0.002583282097849167
# 600   f(x): -0.090513|grad f(p)|:0.0025832820955871503
# 610   f(x): -0.090513|grad f(p)|:0.0025832820933251325
# 620   f(x): -0.090513|grad f(p)|:0.002583282091063122
# 630   f(x): -0.090513|grad f(p)|:0.0025832820888010873
# 640   f(x): -0.090513|grad f(p)|:0.0025832820865390785
# 650   f(x): -0.090513|grad f(p)|:0.0025832820842770442
# 660   f(x): -0.090513|grad f(p)|:0.0025832820820150576
# 670   f(x): -0.090513|grad f(p)|:0.0025832820797530767
# 680   f(x): -0.090513|grad f(p)|:0.0025832820774910523
# 690   f(x): -0.090513|grad f(p)|:0.0025832820752290362
# 700   f(x): -0.090513|grad f(p)|:0.002583282072966992
# 710   f(x): -0.090513|grad f(p)|:0.002583282070704973
# 720   f(x): -0.090513|grad f(p)|:0.0025832820684429532
# 730   f(x): -0.090513|grad f(p)|:0.002583282066180946
# 740   f(x): -0.090513|grad f(p)|:0.0025832820639189306
# 750   f(x): -0.090513|grad f(p)|:0.0025832820616569214
# 760   f(x): -0.090513|grad f(p)|:0.00258328205939488
# 770   f(x): -0.090513|grad f(p)|:0.002583282057132884
# 780   f(x): -0.090513|grad f(p)|:0.0025832820548708406
# 790   f(x): -0.090513|grad f(p)|:0.002583282052608873
# 800   f(x): -0.090513|grad f(p)|:0.002583282050346837
# 810   f(x): -0.090513|grad f(p)|:0.0025832820480848214
# 820   f(x): -0.090513|grad f(p)|:0.0025832820458228205
# 830   f(x): -0.090513|grad f(p)|:0.0025832820435608087
# 840   f(x): -0.090513|grad f(p)|:0.0025832820412987944
# 850   f(x): -0.090513|grad f(p)|:0.0025832820390367726
# 860   f(x): -0.090513|grad f(p)|:0.002583282036774768
# 870   f(x): -0.090513|grad f(p)|:0.002583282034512706
# 880   f(x): -0.090513|grad f(p)|:0.002583282032250709
# 890   f(x): -0.090513|grad f(p)|:0.0025832820299886896
# 900   f(x): -0.090513|grad f(p)|:0.002583282027726701
# 910   f(x): -0.090513|grad f(p)|:0.002583282025464682
# 920   f(x): -0.090513|grad f(p)|:0.0025832820232026517
# 930   f(x): -0.090513|grad f(p)|:0.002583282020940619
# 940   f(x): -0.090513|grad f(p)|:0.0025832820186786334
# 950   f(x): -0.090513|grad f(p)|:0.002583282016416595
# 960   f(x): -0.090513|grad f(p)|:0.0025832820141545986
# 970   f(x): -0.090513|grad f(p)|:0.0025832820118925717
# 980   f(x): -0.090513|grad f(p)|:0.0025832820096305525
# 990   f(x): -0.090513|grad f(p)|:0.002583282007368543
# 1000  f(x): -0.090513|grad f(p)|:0.0025832820051065217

To provide the Hessian in the high-level interface we need to prodive it as an anonymous function, since any struct is considered to (eventually) be the also optional starting point. For space reasons, let’s also shorten the debug print to only iterations 7 and 14.

q4 = trust_regions(M, f, grad_f, (E, p, X) -> Hess_f(E, p, X), p0; objective_type=:Euclidean,
    debug = [:Iteration, :Cost, :GradientNorm, 10, "\n"],
);
Initial f(x): -0.000727
# 10    f(x): -0.089673|grad f(p)|:0.0033633987039373655
q5 = trust_regions(M, f, grad_f, (M, Y, p, X) -> Hess_f(M, Y, p, X), p0;
    evaluation=InplaceEvaluation(),
    debug = [:Iteration, :Cost, :GradientNorm, 10, "\n"],
);
Initial f(x): -0.000727
# 10    f(x): -0.089673|grad f(p)|:0.00336339870393737

Let’s also here compare them in benchmarks. Let’s here compare all variants in their (more performant) in-place versions.

@benchmark trust_regions($M, $f, $grad_f, $p0;
  objective_type=:Euclidean,
  evaluation=InplaceEvaluation(),
)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  491.065 ms … 539.773 ms  ┊ GC (min … max): 3.15% … 3.49%
 Time  (median):     501.502 ms               ┊ GC (median):    3.09%
 Time  (mean ± σ):   502.975 ms ±  14.496 ms  ┊ GC (mean ± σ):  3.15% ± 0.13%

  █▁▁       ▁    █    █                                       ▁  
  ███▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  491 ms           Histogram: frequency by time          540 ms <

 Memory estimate: 1.97 GiB, allocs estimate: 29451.
@benchmark trust_regions($M, $f, $grad_f, $((E, Y, p, X) -> Hess_f(E, Y, p, X)), $p0;
  evaluation=InplaceEvaluation(),
  objective_type=:Euclidean
)
BenchmarkTools.Trial: 329 samples with 1 evaluation.
 Range (min … max):  12.743 ms … 155.346 ms  ┊ GC (min … max): 0.00% … 87.86%
 Time  (median):     14.227 ms               ┊ GC (median):    3.43%
 Time  (mean ± σ):   15.215 ms ±   8.152 ms  ┊ GC (mean ± σ):  6.37% ±  5.79%

  ▇  █▇▇ ▅▅ ▇                                                   
  █▇▁███▁██▇█▇▄▁▄▁▁▆▁▁▁▁▄▁▆▁▁▁▁▁▆▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▆▄▄ ▇
  12.7 ms       Histogram: log(frequency) by time      29.9 ms <

 Memory estimate: 37.42 MiB, allocs estimate: 2788.
@benchmark trust_regions($M, $f, $grad_f, $((M, Y, p, X) -> Hess_f(M, Y, p, X)), $p0;
    evaluation=InplaceEvaluation(),
)
BenchmarkTools.Trial: 541 samples with 1 evaluation.
 Range (min … max):  8.376 ms … 146.597 ms  ┊ GC (min … max): 0.00% … 92.39%
 Time  (median):     8.558 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.249 ms ±   6.065 ms  ┊ GC (mean ± σ):  4.63% ±  5.52%

  ▅█▆▂     ▄▄▃                                                 
  ████▅▅▆▄█████▄▄▁▄▄▄▁▄▁▁▄▁▄▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇▆▅▄▄▁▁▄ ▇
  8.38 ms      Histogram: log(frequency) by time      13.7 ms <

 Memory estimate: 10.75 MiB, allocs estimate: 2767.

We see that Hessian approximation is quite costly, and Gradient and Hessian conversion somewhat costly; still, they also might serve as a good starting point, before deciding to delve into computing Riemannian gradients and Hessians.

Of course all 5 runs obtained solutions close by; one might consider the gradient based runs to not have fully converged.

[distance(M, q1, q) for q ∈ [q2,q3] ]
2-element Vector{Float64}:
 4.471485799821605e-15
 0.048047538209352994
[distance(M, q3, q) for q ∈ [q4,q5] ]
2-element Vector{Float64}:
 0.08269488012454579
 0.08269488012454579

Which we can also see in the final cost, comparing it to the Eigenvalue

[f(M, q) - λ for q ∈ [q1, q2, q3, q4, q5] ]
5-element Vector{Float64}:
  2.76900562450888e-5
  2.769005624428389e-5
 -0.000836208332542443
  3.191891195797325e-16
  3.191891195797325e-16

Summary

We illustrated several possibilities to call solvers, with both Euclidean gradient and Hessian and Riemannian gradient and Hessian, allocating and in-place function. While the performance is better for the Riemannian case, the Euclidean one is a worthy alternative, when those are easier to compute.

Literature