The Rayleigh Quotient

Ronny Bergmann 2024-03-09

Introduction

This example reproduces a few conceptual ideas of Optimization on Manifolds that are used throughout [Bou23] using the Rayleigh quotient and explores several different ways to use the algorithms from Manopt.jl.

For a symmetric matrix $A \in \mathbb R^{n\times n}$ we consider the 📖 Rayleigh Quotient

\[\operatorname*{arg\,min}_{x \in \mathbb R^n \backslash \{0\}} \frac{x^{\mathrm{T}}Ax}{\lVert x \rVert^2}.\]

On the sphere we can omit the denominator and obtain

\[f(p) = p^{\mathrm{T}}Ap,\qquad p ∈ 𝕊^{n-1},\]

which by itself we can again continue in the embedding as

\[\tilde f(x) = x^{\mathrm{T}}Ax,\qquad x \in \mathbb R^n.\]

This cost has the nice feature that at the minimizer $p^*\in\mathbb S^{n-1}$ the function falue $f(p^*)$ is the smalles eigenvalue of $A$.

For the embedded function $\tilde f$ the gradient and Hessian can be computed with classical methods as

\[\begin{align*} ∇\tilde f(x) &= 2Ax, \qquad x ∈ ℝ^n, \\ ∇^2\tilde f(x)[V] &= 2AV, \qquad x, V ∈ ℝ^n. \end{align*}\]

Similarly, cf. Examples 3.62 and 5.27 of [Bou23], the Riemannian gradient and Hessian on the manifold $\mathcal M = \mathbb S^{n-1}$ are given by

\[\begin{align*} \operatorname{grad} f(p) &= 2Ap - 2(p^{\mathrm{T}}Ap)*p,\qquad p ∈ 𝕊^{n-1}, \\ \operatorname{Hess} f(p)[X] &= 2AX - 2(p^{\mathrm{T}}AX)p - 2(p^{\mathrm{T}}Ap)X,\qquad p ∈ 𝕊^{n-1}, X \in T_p𝕊^{n-1} \end{align*}\]

Let’s first generate an example martrx $A$.

using Pkg;
cd(@__DIR__)
Pkg.activate("."); # use the example environment,

using LRUCache, BenchmarkTools, LinearAlgebra, Manifolds, ManoptExamples, Manopt, Random
Random.seed!(42)
n = 500
A = Symmetric(randn(n, n) / n)

And the manifolds

M = Sphere(n-1)

Sphere(499, ℝ)

E = get_embedding(M)

Euclidean(500; field=ℝ)

Setup the corresponding functions

Since RayleighQuotientCost, RayleighQuotientGrad!!, and RayleighQuotientHess!! are themselves manifold agnostic we only need to initialize them once. Agnostic here means that they would compute $f$ is called with M as their first argument and $\tilde f$ if called with E.

We instantiate

f = ManoptExamples.RayleighQuotientCost(A)
grad_f = ManoptExamples.RayleighQuotientGrad!!(A)
Hess_f = ManoptExamples.RayleighQuotientHess!!(A)

the suffix !! also indicates that these functions both work as allocating and in-place variants. Given a starting point and some memory

p0 = [1.0, zeros(n-1)...]
X = zero_vector(M, p0)

we can both call

Y = grad_f(M, p0)  # Allocates memory
grad_f(M, X, p0)    # Computes in place of X and returns the result in X.
norm(M, p0, X-Y)

0.0

Now we can use a few different variants of solvers to approaach this and this tutorial will walk you through a few of them.

First of all let’s construct the actual result – since Rayleigh quotient minimization is not necessarily the best way to compute the smallest Eigenvalue. We can also compute

λ = min(eigvals(A)...)

-0.08924035897103727

A Solver based on gradient information

Let’s first just use first-order information and since we are just starting, maybe we only derived the Euclidean gradient $\nabla \tilde f$. We can “tell” the solver, that the provided function and the gradient are defined as the Euclidean variants in the embedding. internally, Manopt.jl then issues the conversion for Euclidean gradients to the corresponding Riemannian one, cf. e.g. this tutorial section or Section 3.8 or more precisely Example 3.62 in [Bou23].

But instead of diving into all the tecnical details, we can just specify objective_type=:Euclidean to trigger the conversion. We start with a simple gradient descent

s = gradient_descent(M, f, grad_f, p0; objective_type=:Euclidean,
    debug = [:Iteration, :Cost, :GradientNorm, 50, "\n"],
    return_state=true,
)
q1 = get_solver_result(s)
s

Initial f(x): -0.000727
# 50    f(x): -0.088242|grad f(p)|:0.003870474326981599
# 100   f(x): -0.088680|grad f(p)|:0.0034956568288634616
# 150   f(x): -0.089026|grad f(p)|:0.0026514781676923237
# 200   f(x): -0.089178|grad f(p)|:0.001531160335922979

# Solver state for `Manopt.jl`s Gradient Descent
After 200 iterations

## Parameters
* retraction method: ExponentialRetraction()

## Stepsize
ArmijoLinesearch(;
    initial_stepsize=1.0
    retraction_method=ExponentialRetraction()
    contraction_factor=0.95
    sufficient_decrease=0.1
)

## Stopping criterion

Stop When _one_ of the following are fulfilled:
  * Max Iteration 200:  reached
  * |grad f| < 1.0e-8: not reached
Overall: reached
This indicates convergence: No

## Debug
    :Iteration = [(:Iteration, "# %-6d"), (:Cost, "f(x): %f"), (:GradientNorm, "|grad f(p)|:%s"), "\n", 50]

From the final cost we can already see that q1 is an eigenvector to the smallest eigenvalue we obtaines above.

And we can compare this to running with the Riemannian gradient, since the RayleighQuotientGrad!! returns this one as well, when just called with the sphere as first Argument, we just have to remove the objective_type.

q2 = gradient_descent(M, f, grad_f, p0;
    debug = [:Iteration, :Cost, :GradientNorm, 50, "\n"],
)
#Test that both are the same
isapprox(M, q1,q2)

Initial f(x): -0.000727
# 50    f(x): -0.088242|grad f(p)|:0.0038704743269815734
# 100   f(x): -0.088680|grad f(p)|:0.0034956568288633714
# 150   f(x): -0.089026|grad f(p)|:0.002651478167692353
# 200   f(x): -0.089178|grad f(p)|:0.0015311603359229782

true

We can also benchmark both

@benchmark gradient_descent($M, $f, $grad_f, $p0; objective_type=:Euclidean)

BenchmarkTools.Trial: 15 samples with 1 evaluation per sample.
 Range (min … max):  313.438 ms … 365.884 ms  ┊ GC (min … max): 10.02% … 4.56%
 Time  (median):     341.265 ms               ┊ GC (median):    10.18%
 Time  (mean ± σ):   338.854 ms ±  15.925 ms  ┊ GC (mean ± σ):   9.95% ± 1.56%

  ▁    ▁  ▁ ▁      ▁      █       ▁▁   ▁    ▁▁  ▁        ▁    ▁  
  █▁▁▁▁█▁▁█▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁█▁▁▁▁██▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
  313 ms           Histogram: frequency by time          366 ms <

 Memory estimate: 1.13 GiB, allocs estimate: 7964.

@benchmark gradient_descent($M, $f, $grad_f, $p0)

BenchmarkTools.Trial: 161 samples with 1 evaluation per sample.
 Range (min … max):  28.766 ms … 40.709 ms  ┊ GC (min … max): 0.00% … 19.53%
 Time  (median):     30.280 ms              ┊ GC (median):    2.78%
 Time  (mean ± σ):   31.098 ms ±  2.404 ms  ┊ GC (mean ± σ):  2.83% ±  3.53%

    ▇   ▃▄█▂                                                   
  ▃▅██▃▄████▆▅▇▇▃▃▃▃▁▁▁▁▃▁▄▁▁▃▁▁▁▁▃▁▃▆▆▁▃▁▁▁▁▁▃▁▁▃▃▁▁▁▃▁▁▁▃▁▃ ▃
  28.8 ms         Histogram: frequency by time        39.4 ms <

 Memory estimate: 11.20 MiB, allocs estimate: 8750.

From these results we see, that the conversion from the Euclidean to the Riemannian gradient does require a small amount of effort and hence reduces the performance slighly. Still, if the Euclidean Gradient is easier to compute or already available, this is in terms of coding the faster way. Finally this is a tradeoff between derivation and implementation efforts for the Riemannian gradient and a slight performance reduction when using the Euclidean one.

A Solver based (also) on (approximate) Hessian information

To also involve the Hessian, we consider the trust regions solver with three cases:

Euclidean, approximating the Hessian
Euclidean, providing the Hessian
Riemannian, providing the Hessian but also using in-place evaluations.

q3 = trust_regions(M, f, grad_f, p0; objective_type=:Euclidean,
    debug = [:Iteration, :Cost, :GradientNorm, 10, "\n"],
);

Initial f(x): -0.000727
# 10    f(x): -0.088412|grad f(p)|:0.020989167846023046
# 20    f(x): -0.089079|grad f(p)|:0.007420373217153763
# 30    f(x): -0.089095|grad f(p)|:0.0022962557538450884
# 40    f(x): -0.089095|grad f(p)|:0.0022962557538450884
# 50    f(x): -0.089095|grad f(p)|:0.002296255752372694
# 60    f(x): -0.089095|grad f(p)|:0.002296255750900326
# 70    f(x): -0.089095|grad f(p)|:0.002296255749427973
# 80    f(x): -0.089095|grad f(p)|:0.0022962557479555895
# 90    f(x): -0.089095|grad f(p)|:0.002296255746483256
# 100   f(x): -0.089095|grad f(p)|:0.002296255745010824
# 110   f(x): -0.089095|grad f(p)|:0.0022962557435384336
# 120   f(x): -0.089095|grad f(p)|:0.0022962557420660814
# 130   f(x): -0.089095|grad f(p)|:0.0022962557405937175
# 140   f(x): -0.089095|grad f(p)|:0.0022962557391213406
# 150   f(x): -0.089095|grad f(p)|:0.0022962557376489472
# 160   f(x): -0.089095|grad f(p)|:0.0022962557361765603
# 170   f(x): -0.089095|grad f(p)|:0.0022962557347042086
# 180   f(x): -0.089095|grad f(p)|:0.0022962557332318594
# 190   f(x): -0.089095|grad f(p)|:0.002296255731759444
# 200   f(x): -0.089095|grad f(p)|:0.0022962557302870605
# 210   f(x): -0.089095|grad f(p)|:0.002296255728814736
# 220   f(x): -0.089095|grad f(p)|:0.0022962557273423162
# 230   f(x): -0.089095|grad f(p)|:0.002296255725869944
# 240   f(x): -0.089095|grad f(p)|:0.002296255724397584
# 250   f(x): -0.089095|grad f(p)|:0.0022962557229252085
# 260   f(x): -0.089095|grad f(p)|:0.0022962557214528346
# 270   f(x): -0.089095|grad f(p)|:0.0022962557199804855
# 280   f(x): -0.089095|grad f(p)|:0.0022962557185080726
# 290   f(x): -0.089095|grad f(p)|:0.002296255717035685
# 300   f(x): -0.089095|grad f(p)|:0.0022962557155633144
# 310   f(x): -0.089095|grad f(p)|:0.0022962557140909605
# 320   f(x): -0.089095|grad f(p)|:0.002296255712618571
# 330   f(x): -0.089095|grad f(p)|:0.0022962557111461876
# 340   f(x): -0.089095|grad f(p)|:0.002296255709673808
# 350   f(x): -0.089095|grad f(p)|:0.0022962557082014416
# 360   f(x): -0.089095|grad f(p)|:0.00229625570672907
# 370   f(x): -0.089095|grad f(p)|:0.002296255705256684
# 380   f(x): -0.089095|grad f(p)|:0.0022962557037843165
# 390   f(x): -0.089095|grad f(p)|:0.002296255702311948
# 400   f(x): -0.089095|grad f(p)|:0.0022962557008395722
# 410   f(x): -0.089095|grad f(p)|:0.002296255699367198
# 420   f(x): -0.089095|grad f(p)|:0.0022962556978947854
# 430   f(x): -0.089095|grad f(p)|:0.002296255696422434
# 440   f(x): -0.089095|grad f(p)|:0.0022962556949500715
# 450   f(x): -0.089095|grad f(p)|:0.002296255693477701
# 460   f(x): -0.089095|grad f(p)|:0.0022962556920053264
# 470   f(x): -0.089095|grad f(p)|:0.002296255690532953
# 480   f(x): -0.089095|grad f(p)|:0.0022962556890605674
# 490   f(x): -0.089095|grad f(p)|:0.002296255687588196
# 500   f(x): -0.089095|grad f(p)|:0.002296255686115812
# 510   f(x): -0.089095|grad f(p)|:0.0022962556846434557
# 520   f(x): -0.089095|grad f(p)|:0.00229625568317109
# 530   f(x): -0.089095|grad f(p)|:0.0022962556816987227
# 540   f(x): -0.089095|grad f(p)|:0.002296255680226299
# 550   f(x): -0.089095|grad f(p)|:0.0022962556787539364
# 560   f(x): -0.089095|grad f(p)|:0.0022962556772815833
# 570   f(x): -0.089095|grad f(p)|:0.0022962556758091917
# 580   f(x): -0.089095|grad f(p)|:0.0022962556743368473
# 590   f(x): -0.089095|grad f(p)|:0.0022962556728644322
# 600   f(x): -0.089095|grad f(p)|:0.002296255671392082
# 610   f(x): -0.089095|grad f(p)|:0.0022962556699197066
# 620   f(x): -0.089095|grad f(p)|:0.002296255668447314
# 630   f(x): -0.089095|grad f(p)|:0.002296255666974941
# 640   f(x): -0.089095|grad f(p)|:0.002296255665502539
# 650   f(x): -0.089095|grad f(p)|:0.002296255664030171
# 660   f(x): -0.089095|grad f(p)|:0.002296255662557798
# 670   f(x): -0.089095|grad f(p)|:0.0022962556610854248
# 680   f(x): -0.089095|grad f(p)|:0.002296255659613032
# 690   f(x): -0.089095|grad f(p)|:0.0022962556581407082
# 700   f(x): -0.089095|grad f(p)|:0.002296255656668318
# 710   f(x): -0.089095|grad f(p)|:0.0022962556551958985
# 720   f(x): -0.089095|grad f(p)|:0.002296255653723546
# 730   f(x): -0.089095|grad f(p)|:0.002296255652251183
# 740   f(x): -0.089095|grad f(p)|:0.0022962556507788168
# 750   f(x): -0.089095|grad f(p)|:0.0022962556493064516
# 760   f(x): -0.089095|grad f(p)|:0.002296255647834051
# 770   f(x): -0.089095|grad f(p)|:0.002296255646361663
# 780   f(x): -0.089095|grad f(p)|:0.0022962556448893287
# 790   f(x): -0.089095|grad f(p)|:0.0022962556434169014
# 800   f(x): -0.089095|grad f(p)|:0.0022962556419445497
# 810   f(x): -0.089095|grad f(p)|:0.0022962556404721797
# 820   f(x): -0.089095|grad f(p)|:0.0022962556389998293
# 830   f(x): -0.089095|grad f(p)|:0.002296255637527417
# 840   f(x): -0.089095|grad f(p)|:0.0022962556360550538
# 850   f(x): -0.089095|grad f(p)|:0.00229625563458268
# 860   f(x): -0.089095|grad f(p)|:0.002296255633110347
# 870   f(x): -0.089095|grad f(p)|:0.0022962556316379126
# 880   f(x): -0.089095|grad f(p)|:0.0022962556301655314
# 890   f(x): -0.089095|grad f(p)|:0.0022962556286931957
# 900   f(x): -0.089095|grad f(p)|:0.002296255627220805
# 910   f(x): -0.089095|grad f(p)|:0.0022962556257484328
# 920   f(x): -0.089095|grad f(p)|:0.0022962556242760554
# 930   f(x): -0.089095|grad f(p)|:0.0022962556228036954
# 940   f(x): -0.089095|grad f(p)|:0.002296255621331309
# 950   f(x): -0.089095|grad f(p)|:0.0022962556198589204
# 960   f(x): -0.089095|grad f(p)|:0.0022962556183865473
# 970   f(x): -0.089095|grad f(p)|:0.002296255616914149
# 980   f(x): -0.089095|grad f(p)|:0.0022962556154418005
# 990   f(x): -0.089095|grad f(p)|:0.0022962556139694153
# 1000  f(x): -0.089095|grad f(p)|:0.0022962556124970193

To provide the Hessian in the high-level interface we need to prodive it as an anonymous function, since any struct is considered to (eventually) be the also optional starting point. For space reasons, let’s also shorten the debug print to only iterations 7 and 14.

q4 = trust_regions(M, f, grad_f, (E, p, X) -> Hess_f(E, p, X), p0; objective_type=:Euclidean,
    debug = [:Iteration, :Cost, :GradientNorm, 10, "\n"],
);

Initial f(x): -0.000727
# 10    f(x): -0.089234|grad f(p)|:0.0013561755210368023

q5 = trust_regions(M, f, grad_f, (M, Y, p, X) -> Hess_f(M, Y, p, X), p0;
    evaluation=InplaceEvaluation(),
    debug = [:Iteration, :Cost, :GradientNorm, 10, "\n"],
);

Initial f(x): -0.000727
# 10    f(x): -0.089234|grad f(p)|:0.0013561755210368012

Let’s also here compare them in benchmarks. Let’s here compare all variants in their (more performant) in-place versions.

@benchmark trust_regions($M, $f, $grad_f, $p0;
  objective_type=:Euclidean,
  evaluation=InplaceEvaluation(),
)

BenchmarkTools.Trial: 7 samples with 1 evaluation per sample.
 Range (min … max):  750.045 ms … 806.410 ms  ┊ GC (min … max):  7.09% … 10.98%
 Time  (median):     782.691 ms               ┊ GC (median):    10.68%
 Time  (mean ± σ):   781.670 ms ±  19.942 ms  ┊ GC (mean ± σ):  10.21% ±  1.39%

  █           █                   █  █          █    █        █  
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁█ ▁
  750 ms           Histogram: frequency by time          806 ms <

 Memory estimate: 1.96 GiB, allocs estimate: 75417.

@benchmark trust_regions($M, $f, $grad_f, $((E, Y, p, X) -> Hess_f(E, Y, p, X)), $p0;
  evaluation=InplaceEvaluation(),
  objective_type=:Euclidean
)

BenchmarkTools.Trial: 217 samples with 1 evaluation per sample.
 Range (min … max):  18.994 ms … 42.563 ms  ┊ GC (min … max):  0.00% … 8.23%
 Time  (median):     22.067 ms              ┊ GC (median):    10.55%
 Time  (mean ± σ):   23.041 ms ±  3.516 ms  ┊ GC (mean ± σ):   9.60% ± 3.83%

  ▁   ▂▂▄██▅▂▁▃  ▂                                             
  █▇▅▆██████████▇█▇▆▁▁▆▁▁▁▆▅▅▁▁▁▁▆▁▁▁▁▁▁▁▅▁▁▅▁▅▅▁▁▁▁▆▁▁▁▁▁▁▁▅ ▆
  19 ms        Histogram: log(frequency) by time      40.6 ms <

 Memory estimate: 44.92 MiB, allocs estimate: 12651.

@benchmark trust_regions($M, $f, $grad_f, $((M, Y, p, X) -> Hess_f(M, Y, p, X)), $p0;
    evaluation=InplaceEvaluation(),
)

BenchmarkTools.Trial: 336 samples with 1 evaluation per sample.
 Range (min … max):  12.458 ms … 46.158 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.305 ms              ┊ GC (median):    9.22%
 Time  (mean ± σ):   14.887 ms ±  3.059 ms  ┊ GC (mean ± σ):  7.51% ± 6.54%

  ▂▄▃   ██▂▆▁                                                  
  ███▆▂▄█████▅▆▅▃▃▃▁▁▂▂▁▁▁▁▁▁▁▃▅▃▂▃▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▂▁▃▁▂▁▁▂ ▃
  12.5 ms         Histogram: frequency by time        27.3 ms <

 Memory estimate: 16.35 MiB, allocs estimate: 12633.

We see that Hessian approximation is quite costly, and Gradient and Hessian conversion somewhat costly; still, they also might serve as a good starting point, before deciding to delve into computing Riemannian gradients and Hessians.

Of course all 5 runs obtained solutions close by; one might consider the gradient based runs to not have fully converged.

[distance(M, q1, q) for q ∈ [q2,q3] ]

2-element Vector{Float64}:
 8.835276992173579e-16
 0.8953986994946139

[distance(M, q3, q) for q ∈ [q4,q5] ]

2-element Vector{Float64}:
 1.1159334410358788
 1.1159334410358788

Which we can also see in the final cost, comparing it to the Eigenvalue

[f(M, q) - λ for q ∈ [q1, q2, q3, q4, q5] ]

5-element Vector{Float64}:
  6.211293387550776e-5
  6.211293387559103e-5
  0.0001451688142172225
 -5.551115123125783e-17
 -5.551115123125783e-17

Summary

We illustrated several possibilities to call solvers, with both Euclidean gradient and Hessian and Riemannian gradient and Hessian, allocating and in-place function. While the performance is better for the Riemannian case, the Euclidean one is a worthy alternative, when those are easier to compute.

Literature

[Bou23]: N. Boumal. An Introduction to Optimization on Smooth Manifolds. First Edition (Cambridge University Press, 2023).