..

Your function's doppelgänger

Your function’s doppelgänger | Federico Magnani’s blog

One of the first things I realized by approaching convex analysis is the importance of notation. The use of simple symbols or tiny marks adjustments to encapsulate different topics, sometimes even a wide class of problems, is crucial for developing extremely powerful mathematical primitives. The Fenchel conjugate, introduced by Werner Fenchel in 1949, perfectly illustrates this idea. Although it may seem mysterious at first glance, it is one of the most important primitives in duality theory. In this post, as in others I may write in the future, I’m not aiming to develop a fully rigorous mathematical treatment of this concept. Instead, I want to offer inputs that spark the kind of mental machinery that helps build a practical intuition, making it easier to categorize and recall in your mind.

Definition

Given any function \(f:\mathbb{R}^n\to\mathbb{R}\), the Fenchel conjugate of \(f\) is defined as

\[f^\star(y)=\sup_x\; \langle x,y\rangle - f(x)\]

You can refer to the legendary Boyd’s book (chapter 3.3) or Stanford class record to have a nice and more exhaustive introduction to this concept. At first glance, the Fenchel conjugate resembles the optimal value of a tiny optimization problem in \(x\), where the optimal value is kept as function of the parameter \(y\). Thus, speaking about notation, the apparently harmless symbol \(f^\star\) is quietly hiding the solution to a certain optimization problem involving its spiritual relative \(f\). This is a good path to develop a first intuition about this operator, so it is worth elaborate more with an example.

Example for hypebeasts

Imagine you work at Nike and want to determine the optimal quantity of the upcoming hyped sneakers to release on SNKRS that are going to be sniped by hypebeasts. Given that your marketing manager has already set fixed prices

\[y=(y_1,\dots,y_n)\in\mathbb{R}^n\]

for each of the \(n\) upcoming collections, you are asking:

“What are the optimal amounts of sneakers \(x=(x_1,\dots,x_n)\in\mathbb{R}^n\) that should be produced and released for this drop? Should we drop more Air Jordan 3 or Nike Dunk Low?”.

Suppose that the production and distribution cost is captured by a certain function \(f:\mathbb{R}^n\to\mathbb{R}\) (as fancy as you want) which aggregates the total cost of producing each quantity \(x_i\) for every collection. In other words, \(f(x)\) represents the total cost for this drop if you decide to produce the vector of amounts \(x\). On the other hand, since the \(i\)-th collection is going to be sold at a price of \(y_i\) dollars and assuming the hype for the upcoming drop is so high that there will likely be no leftovers, the total revenue for this drop is going to be

\[\langle x, y\rangle=\sum_{i=1}^n y_ix_i\]

Having said that, it would be wise producing the amount of sneakers that would maximize the profit for this drop, where the profit function is simply total revenues minus total costs, that is:

\[\Pi(x;y)=\langle x, y\rangle-f(x)\]

As a result, armed with optimization-solvers, your goal is to solve the following optimization problem:

\[\begin{align*} \sup_{x}\; \Pi(x;y) &= \sup_{x}\; \langle x, y\rangle-f(x)\\ &=:f^\star(y) \end{align*}\]

which precisely resembles the Fenchel conjugate definition! In other words, by knowing the fenchel conjugate \(f^\star\) of the production and distribution cost function \(f\) you also know an upper bound on the maximal profit you can score from this drop given any possible prices recommended by the marketing manager, and such upper bound is attainable if you produce the optimal amount of sneakers

\[x^\star(y) = \arg\max_x\; \langle x, y\rangle-f(x)\]

Giving prices \(y\), if \(f^\star(y)\) is particularly low then you can’t be particularly ambitious on the gains of this drop, and you have a provable argument to blame the marketing manager for the chosen prices.

Some properties

Convexity and lower semi-continuity

The Fenchel conjugates \(f^\star\) might have different properties depending on the related function \(f\). However, two properties that are always guaranteed (independently from the structure of \(f\)) are

  • Convexity
  • Lower semi-continuity (closed epigraph)

Both properties are particularly desireable in numerical optimization (for reasons I might discuss in future posts) and can be rigorously derived. To get a rough idea of why these two properties are always ensured, you can focus on the fact that by definition \(f^\star\) is the element-wise supremum of a collection of affine functions. Notice in fact that

\[\langle x, y\rangle-f(x)\]

is affine in \(y\) if you consider \(x\) as a simple parameter. So for example by picking different values of \(x\), like \(x_1\) or \(x_2\), you come up with two different linear functions. In other words, the parameter \(x_i\) that you pick fully characterizes the function \(\langle x_i, y\rangle-f(x_i)\): this means that if you have a collection of points

\[\{x_1, x_2,\dots\}\]

you also have a collection of functions

\[\mathcal{C_f}=\{\langle x_1, y\rangle-f(x_1),\langle x_2, y\rangle-f(x_2), \dots\}\]

where each function is indexed by the respective \(x_i\). Of course, since \(x_i\in\mathbb{R}^n\) (i.e. the collection of indices is \(\mathbb{R}^n\)) you have infinitely many indices and so also the collection \(\mathcal{C_f}\) is made of infinitely many elements.

Now, a very heuristical description of what the Fenchel conjugate does is the following:

Given a certain \(y\) as input, can you return the element with the highest value searching in the collection \(\mathcal{C_f}\)?

Suppose that \(\langle x_k, y\rangle-f(x_k)\) is the highest value, then \(x_k\) is the index associated with he highest value.

I hope that this very basic and informal description might helped you in interpreting

\[f^\star(y) =\sup_{x}\; \langle x, y\rangle-f(x)\]

as the “highest value” in the collection \(\mathcal{C_f}\) given a certain \(y\) and

\[\arg\max_{x}\; \langle x, y\rangle-f(x)\]

as the “index point” where such value is attained.

As mentioned, understanding the Fenchel conjugate as the elementwise supremum of a collection of affine function is extremely important also for the rigorous proof of convexity and lower semi-continuity of the Fenchel conjugate (omitted here), but we can still use this argument to develop a graphical intuition. Very naively, consider

\[f(x)=x^2\]

then, consider

\[x^\star(y)= \arg\max_x \; \langle x, y\rangle-f(x)\]

that is related to the previous example as the point where the highest value of the function \(\langle x, y\rangle-f(x)\) is attained given a certain \(y\). Notice in fact that

\[\begin{align*} f^\star(y) &=\sup_x\; \langle x, y\rangle-f(x) \\ &= \langle x^\star(y), y\rangle-f(x^\star(y)) \end{align*}\]

So in order to compute \(f^\star\) is enough finding \(x^\star\). Adapted to our trivial problem, we have

\[x^\star(y)= \arg\max_x \; xy-x^2\]

and

\[\begin{align*} f^\star(y) &=\sup_x\; xy-x^2\\ & = x^\star(y)y-(x^\star(y))^2 \end{align*}\]

In order to find \(x^\star(y)\) we can trivially apply first order condition on \(xy-x^2\) to see that

\[y-2(x^\star(y))=0\iff x^\star(y)=\frac{y}{2}\]

As a result

\[f^\star(y)=\frac{y}{2}y-\left(\frac{y}{2}\right)^2=\frac{y^2}{4}\]

we can now plot both the graphs of \(f\) (in red) and \(f^\star\) (in green) to practically taste the nature of the Fenchel conjugate as the supremum of a collection of linear functions (if you are reading from pc, you click in the bottom right of the frame and play around with the parameters!).

Notice in fact how the graph of each \(\langle x_i, y\rangle-f(x_i)\) defines a supporting hyperplane of the epigraph of \(f^\star\) at the point \((y, f^\star(y))\). In particular, the epigraph of \(f^\star\) is fully contained in the epigraph of each \(\langle x_i, y\rangle-f(x_i)\) which is a halfspace. Most importantly, the intersection of all such infinitely many halfspaces precisely recovers the epigraph of \(f^\star\) demonstrating that it is a closed convex set. This construction is the typical dual representation of closed convex sets.

Why is this important? Because the convexity and lower-semicontinuity of \(f^\star\) are naturally inherited from the convexity and closedness of its epigraph.

Fenchel-Young inequality

Another very important property, which is actually the very first characterization of conjugate functions in Fenchel’s seminal work, is the Fenchel-Young inequality, which bounds \(f\) with its doppelgänger as follows

\[f(x)+f^\star(y)\geq\langle x, y\rangle \quad \forall \;x,y\]

This property is naturally derived from the definition od the conjugate, indeed notice that

\[\begin{align*} f^\star(y)&:=\sup_{x}\; \langle x, y\rangle-f(x) \\ &\geq \langle x, y\rangle-f(x) \quad \forall \;x,y \end{align*}\]

Indeed, for any given \(y\) the LHS of the inequality will “tune” x to attain the supremum, and because of this tuning the image can’t be “worse” (in the sense of lower) compared with the evaluation of a generic possibly non-optimal point \(x\).

Support functions

As just experienced, mapping set properties to function properties is the secret sauce of convex analysis: it gives you a very deep understanding of convex functions and how they relate to each other by simply inspecting the properties of the associated sets. This is particularly true for support functions.

Relation with supporting halfspaces

Rougly speaking, given a closed convex set \(C\in\mathbb{R}^n\), the support function \(\delta^\star(y\|C)\) maps a generic normal vector to the “optimal offset” so that the set

\[H(y)=\{x: \langle x,y\rangle\leq\delta^\star(y\|C)\}\]

denotes a supporting halfspace for the set \(C\). By definition of supporting halfspace, this means that given any \(y\) one will always have

\[\langle x,y\rangle\leq\delta^\star(y\|C) \quad \forall\; x\in C\]

where \(\langle x,y\rangle=\delta^\star(y\|C)\) only at a certain point (or set of points), where the hyperplane associated with \(H(y)\) is tangent to \(C\).

In other words, since there is no single point in \(C\) such that \(\langle x,y\rangle\) can be greater than \(\delta^\star(y\|C)\) and the best you can have is \(\langle x,y\rangle=\delta^\star(y\|C)\) for at least one point in \(C\), you can safely say that

\[\delta^\star(y\|C)=\sup_{x\in C} \langle x,y\rangle\]

Relation with indicator function

So, \(\delta^\star(y\|C)\) is the result of a maximization of a linear function over a convex set (if the set \(C\) is polyhedral, then it is actually a linear program).

A typical approach to solve this problem is introducing lagrangian variables that gently penalizes the objective if the constraints are not satisfied, but we can do something more “violent” introducing a function that is a ghost when the constraints are met and that attaches an infinite penalty if the conditions are not met. This approach allows us to introduce the characteristic (or indicator) function (notice that is different from the probability indicator function), which is

\[\delta(x\|C) = \begin{cases} 0 & \text{if } x \in C, \\ \infty & \text{otherwise}. \end{cases}\]

Notice that the effective domain of this function (i.e. the set of points where the functions takes non-infinite values) corresponds to the domain of the optimization problem (i.e. the set of feasible points), which is the set \(C\)

\[\text{dom}\;\delta(\cdot\|C) = C\]

Through the indicator function we can rephrase the problem in the “violent” form, that is

\[\begin{align*} \delta^\star(y\|C)&=\sup_{x\in C} \langle x,y\rangle \\ &=\sup_{x} \langle x,y\rangle - \delta(x\|C) \end{align*}\]

However, this reveals that the support function of \(C\) is precisely the conjugate of the indicator function of \(C\)! If you were keeping an eye on the notation, congrats, you basically gave yourself a little spoiler already. It is worth giving a second look also at he supporting half-space definition:

\[\langle x,y\rangle\leq\delta^\star(y\|C) \quad \forall \;(x,y)\in C\times \mathbb{R}^n\]

which can be safely rewritten as \(\langle x,y\rangle\leq\delta^\star(y\|C) + \delta(y\|C) \quad \forall \;x,y\) this allows us also to re-interpret the supporting half-space definition as the typical Fenchel-Young inequality!

Dual representation of closed convex sets

We can give a look also to the set definition of the “supported” set \(C\). Indeed, whatever is the actual shape of \(C\), we surely know that there exist some vector \(y\) acting as normal vector of a halfspace \(H(y)\) such that \(C\) is fully contained in \(H(Y)\). Since by assumption \(C\) is also closed, this suggests that \(C\) is indeed definable as the intersection of all the half-spaces supporting it which, as shown in the previous example, is the typical dual representation of closed convex set. This is quite powerful indeed: we can have a set definition for \(C\) while being totally agnostic of what the actual shape of \(C\) is!

\[\begin{align*} C&=\bigcap_y H(y)\\ &= \{x: \langle x,y\rangle\leq\delta^\star(y\|C),\;y\in\mathbb{R}^n\} \end{align*}\]

We can simply rewrite it as

\[\begin{align*} C&=\left\{x: \sup_{y\in\mathbb{R}^n}\langle x,y\rangle-\delta^\star(y\|C)\leq 0,\right\}\\ &=\{x: \delta^{\star\star}(x\|C)\leq 0\} \end{align*}\]

Which involves the conjugate of a conjugate! Typically such function would be the closed version of the original function \(f\), however since the indicator function is already closed, we have \(\delta^{\star\star}(x\|C)=\delta(x\|C)\). As expected by the definition of the indicator function, \(C\) is indeed its effective domain

\[C=\{x: \delta(x\|C)\leq 0\}=\text{dom}\; \delta(\cdot\|C)\]

Example: norms

The typical examples to introduce support functions are the convex norms. In particular, the euclidean norm \(\|\cdot\|_2\) is support function of its own ball, while the L-1 norm is the support function of the L-infinity ball and vice versa (hinting another fancy relationship named “polarity” between the two balls, but we are not discussing it here, you can refer to Convex Analysis by Rockafellar, chapter 14). You can see the behavior of these support functions below, play with it!

A subdifferential quick look

Without digging too much into subdifferentiability, we simply introduce the subdifferential of a convex function as a multivalued mapping that assigns to each point \(x_0\) in the domain of \(f\) a set of vectors called subgradients. These subgradients are the vectors that satisfy the subgradient inequality:

\[y\in \partial f(x_0) \iff f(z)\geq f(x_0)+\langle y,z-x_0\rangle\; \forall\; z\]

If the RHS looks familiar it’s because in the differentiable case it resembles the first order Taylor expansion of the function at \(x_0\): in particular, the subdifferential inequality also tells you that any first order approximations of the function \(f\) will always underestimate it (unless you are considering \(z=x_0\), in that case they are equal), indeed it is said to be a global underestimator of \(f\).

Subdiferrential representation with Fenchel conjugates

Why is it interesting in this post? because the Fenchel conjugate is also here. Indeed with some adjustments of the subgradient inequality above you can see that

\[\begin{align*} &\langle y,x_0\rangle\geq f(x_0) +\sup_z \;\langle y,z\rangle -f(z)\; \forall\; y\in \partial f(x_0) \\ &\langle y,x_0\rangle\geq f(x_0) + f^\star(y) \; \forall\; y\in \partial f(x_0) \end{align*}\]

Thus, the “rephrased” subgradient inequality seems to have opposite sign compared with the Fenchel-Young inequality and this might be confusing, since we expect the latter to be valid with any points. Indeed both inequalities are valid, since neither of them expect strict inequality: this means that the relationship holds with equality sign. Hence, rather then talking about “subdifferential inequality”, when we use the conjugate notation we actually refer to a “subgradient equality” even if it is not proper to say. Thus, the subdifferential can be rewritten as follows

\[\partial f(x_0) = \{y:f(x_0)+f^\star(y)=\langle y,x_0\rangle\}\]

Dual correspondence of subdifferentials

This relation is extremely useful because for closed convex functions (such that \(f^{\star\star}=f\)) it provides the same set rule for both the subdifferentials of f and of \(f^\star\), indeed

\[\begin{align*} &\partial f(x_0) = \{y:f(x_0)+f^\star(y)=\langle y,x_0\rangle\} \\ &\partial f^\star(y_0) = \{x:f^\star(y_0)+f(x)=\langle y_0,x\rangle\} \end{align*}\]

As a result, for closed convex functions one has

\[y_0\in\partial f(x_0)\iff x_0\in \partial f^\star(y_0)\]

Which also tells that for closed convex functions \(\partial f\) and \(\partial f^\star\) are inverse maps of each other. Indeed, with some abuse of notation, you can inject one relation into the other to see that

\[\begin{align*} & y_0\in \partial f(\partial f^{\star}(y_0))\\ & x_0\in \partial f^{\star}(\partial f(x_0)) \end{align*}\]

Meaning that the composition of the two multivalued maps resembles the identity operator, and so that \(\partial f\) and \(\partial f^\star\) are inverse maps of each other

\[\partial f \circ \partial f^\star = \partial f^\star \circ \partial f=I\]

This inverse mapping relatioship gives us hints also on where the function \(\langle y,x\rangle -f(x)\), involved in the definition of the conjugate, attains the supremum. Indeed, by applying first order conditions, one has that the point where the function will attain the maximum satisfies this confition:

\[0\in y-\partial f(x) \iff x^\star(y) \in \partial f^\star(y)\]

Thus, if the function is closed and differentiable (i.e. when \(\partial f(x)=\{\nabla f(x)\}\)), one has that

\[\begin{align*} &f^\star(y)= \langle y,\nabla f^\star(y)\rangle - f(\nabla f^\star(y))\\ &f(x)= \langle x,\nabla f(x)\rangle - f^\star(\nabla f^(x)) \end{align*}\]

We can have a graphical interpretation of this concept keeping the same example above with \(f(x)=x^2\), showing that the optimal point of \(yx-x^2\) is \(y/2\) (i.e. \(\nabla f^\star(y)\)) and that the optimal value is \(\frac{y^2}{4}\) (i.e. \(f(\nabla f^\star(y))\)). In particular, the interpolation of all such maximal points, resembles the original graph of \(f(x)\), quite psychedelic isn’t it?

If you reached the end of this post you are a chad. See you next time!


github X rss