Skip to content

Commit 955a9c0

Browse files
committed
Merge branch 'main' of github.com:modula-systems/modula
2 parents 559acb1 + c971d1e commit 955a9c0

File tree

2 files changed

+35
-17
lines changed

2 files changed

+35
-17
lines changed
Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,40 @@
11
Newton-Schulz
22
==============
33

4-
.. admonition:: Warning
5-
:class: warning
4+
On this page, we will work out a family of iterative algorithms for "orthogonalizing a matrix", by which we mean transforming either the rows or the columns of the matrix to form an orthonormal set of vectors.
5+
In particular, we will consider the map that sends a matrix :math:`M\in\mathbb{R}^{m\times n}` with reduced SVD :math:`M = U \Sigma V^\top` to the matrix :math:`U V^\top`. This operation can be thought of as "snapping the singular values of :math:`M` to one"---although the iterations we consider will actually fix zero singular values at zero. We will refer to the orthogonalized matrix corresponding to :math:`M` as :math:`M^\sharp`---pronounced "M sharp"---so that:
6+
7+
.. math::
8+
M = U \Sigma V^\top \mapsto M^\sharp = U V^\top.
9+
10+
This "sharp operation" is sometimes referred to as `"symmetric orthogonalization" <https://en.wikipedia.org/wiki/Orthogonalization>`_ because no row or column of the matrix :math:`M` is treated as special in the procedure. This is in contrast to `Gram-Schmidt orthogonalization <https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process>`_, which involves first picking out a certain row or column vector as special and then orthogonalizing the remaining vectors against this vector.
611

7-
This page is still under construction.
812

9-
History of orthogonalization
10-
----------------------------
13+
Steepest descent under the spectral norm
14+
-----------------------------------------
15+
16+
The reason we care about orthogonalization and the sharp operator in the context of neural network optimization is that it is an essential primitive for solving the problem of "steepest descent under the spectral norm". For a matrix :math:`G\in\mathbb{R}^{m\times n}` thought of as the gradient of a loss function, the sharp operator solves the following problem:
17+
18+
.. math::
19+
G^\sharp = \operatorname{arg max}_{\Delta W \in \mathbb{R}^{m\times n} \,:\, \|\Delta W\|_* \leq 1} \langle G , \Delta W \rangle,
1120
12-
- procrustes problem
13-
- loewdin symmetrization
14-
- sharp-operator: frank-wolfe? nesterov?
15-
- neural nets: carlin
21+
where :math:`\langle \cdot, \cdot \rangle` denotes the Frobenius inner product and :math:`\|\cdot\|_*` denotes the spectral norm. In words, the sharp operator tells us the direction :math:`\Delta W` in matrix space that squeezes out the most linearized change in loss :math:`\langle G, \Delta W \rangle` while keeping the spectral norm under control. Keeping the spectral norm of the weight update under control is important as it allows us to guarantee that the features of the model change by a controlled amount.
22+
23+
Historical connections
24+
-----------------------
25+
26+
The procedure of symmetric orthogonalization appears in a number of different contexts:
27+
28+
- it is used to solve the `orthogonal Procrustes problem <https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem>`_.
29+
- it is used to compute the "orthogonal polar factor" in the `polar decomposition <https://en.wikipedia.org/wiki/Polar_decomposition>`_ of a matrix.
30+
- it was used by `Per-Olov Löwdin <https://en.wikipedia.org/wiki/Per-Olov_L%C3%B6wdin>`_ in the 1950s to perform atomic and molecular orbital calculations.
31+
- it is used in `Frank-Wolfe optimization <https://proceedings.mlr.press/v28/jaggi13>`_ over the spectral norm ball.
32+
- `Preconditioned Spectral Descent for Deep Learning <https://papers.nips.cc/paper_files/paper/2015/hash/f50a6c02a3fc5a3a5d4d9391f05f3efc-Abstract.html>`_.
33+
34+
Newton-Schulz iteration
1635

1736
- kovarik, bjorck and bowie
1837
- higham: newton-schulz
19-
2038
- anil and grosse: for weights not updates
2139

2240
Polynomial iterations
@@ -30,8 +48,8 @@ A cubic iteration
3048
3149
.. raw:: html
3250

33-
<iframe src="https://www.desmos.com/calculator/qzvof94grn?embed" width="48%" height="300px" frameborder="0"></iframe>&nbsp;&nbsp;&nbsp;
34-
<iframe src="https://www.desmos.com/calculator/2d0ekimums?embed" width="48%" height="300px" frameborder="0"></iframe>
51+
<iframe src="https://www.desmos.com/calculator/qzvof94grn?embed" width="47%" height="300px" frameborder="0" style="margin-right: 4%"></iframe>
52+
<iframe src="https://www.desmos.com/calculator/2d0ekimums?embed" width="47%" height="300px" frameborder="0"></iframe>
3553

3654
some more text
3755

@@ -43,8 +61,8 @@ A quintic iteration
4361
4462
.. raw:: html
4563

46-
<iframe src="https://www.desmos.com/calculator/fjjjpsnl2g?embed" width="48%" height="300px" frameborder="0"></iframe>&nbsp;&nbsp;&nbsp;
47-
<iframe src="https://www.desmos.com/calculator/1aqrfjge22?embed" width="48%" height="300px" frameborder="0"></iframe>
64+
<iframe src="https://www.desmos.com/calculator/fjjjpsnl2g?embed" width="47%" height="300px" frameborder="0" style="margin-right: 4%"></iframe>
65+
<iframe src="https://www.desmos.com/calculator/1aqrfjge22?embed" width="47%" height="300px" frameborder="0"></iframe>
4866

4967
A speedy iteration
5068
-------------------
@@ -54,5 +72,5 @@ A speedy iteration
5472
5573
.. raw:: html
5674

57-
<iframe src="https://www.desmos.com/calculator/4xsjfwa5vh?embed" width="48%" height="300px" frameborder="0"></iframe>&nbsp;&nbsp;&nbsp;
58-
<iframe src="https://www.desmos.com/calculator/9yjpijk1fv?embed" width="48%" height="300px" frameborder="0"></iframe>
75+
<iframe src="https://www.desmos.com/calculator/4xsjfwa5vh?embed" width="47%" height="300px" frameborder="0" style="margin-right: 4%"></iframe>
76+
<iframe src="https://www.desmos.com/calculator/9yjpijk1fv?embed" width="47%" height="300px" frameborder="0"></iframe>

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The docs currently contain some original research contributions not published an
5252
:maxdepth: 2
5353
:caption: Algorithms:
5454

55-
algorithms/newton-schulz
55+
.. algorithms/newton-schulz
5656
algorithms/manifold/index
5757

5858
.. toctree::

0 commit comments

Comments
 (0)