In this story, we will explore a clear and insightful grasp of three related concepts in mathematical optimization. These concepts required substantial time and effort for me to fully grasp, so I’ve aimed to present them in an intuitive way for all readers. Our journey will commence with a refresher on unconstrained optimization, followed by a consideration for constrained optimization, where we’ll utilize Lagrange Multipliers and KKT conditions. We also delve into the interplay between these ideas and their connections to the concept of duality.

## Unconstrained Optimization

In unconstrained optimization, we are given a multivariable function *f(u)* and we want to find the value for the vector *u* *where the value of the function *f(u*) *is optimum (maximum or minimum).

A function in general can have multiple maxima and minima as shown above. In classical machine learning and throughout this story, we will be mostly interested in convex functions (that are also sufficiently smooth). Being convex implies that the function has at most one optimum value (which is one minimum when the function at hand is a loss function) as shown below:

It’s much easier to deal with convex functions since otherwise it can be really hard to tell whether the minimum found is the lowest of all (i.e., the global minimum) and not just some local minimum. In general, even when there is one minimum value there can be many points that satisfy it (e.g., if it’s flat) we will pretend that this case doesn’t happen to simplify the explanation; assuming that it happens won’t change anything we derive whatsoever.

Performing unconstrained optimization on a given multivariable function *f(u) *is possible by solving *∇ᵤf(u) = 0. *If *f(u) *is a function in *n *variables *(u***₁***, u*₂*,…,u*ₙ*) *then this is a system of *n *equations:

Which once solved returns the optimal solution *u*=(u****₁***,u**₂*,…,u**ₙ*) *which is where the optimal value (e.g., minimum) occurs.

To see where this comes from, recall that:

- The normal vector to the tangent plane at any point
*u*takes the form*(∂f(u)/∂u₁, ∂f(u)/∂u*₂, …,*∂f(u)/∂u*ₙ,*f(u))* - The tangent plane at any minimum or maximum is horizontal (visually obvious)
- Hence, whenever
*∇ᵤf(u) = 0*holds, there is a horizontal tangent plane at that point and thus, must be the minimum we are looking for.

Another way to rationalize this which will be useful shortly, is to observe that the gradient points towards the direction of greatest increase (AND opposite to the direction of greatest decrease). Thus, when *∇ᵤf(u) = 0 *holds, it must either be not possible to (there is no direction that would) increase the function from that point (i.e., at maximum) or decrease the function from that point (i.e., at a minimum).

**Unconstrained Optimization Summary**

**Given:** *f(u)***Wanted:** *u where f(u) is minimum***Approach:** Solve *∇ᵤf(u)* = 0 since that holds at the minimum

In this type of optimization, we are given an equality constraint of the form *g(u)=0 *or g(u)≤0 (else we can put it in this form by rearranging terms or multiplying by a negative) and we want to optimize **over only all the points satisfying the constraint**.

We typically assume that equality constraints *g(u)=0* are affine (generalization of linear) and that inequality constraints g(u)≤0 involve convex functions so that the whole optimization problem is convex (otherwise, the fact that *f(u)* is convex alone may not be sufficient to guarantee one optimal value).

**Constraint Optimization with Equality Constraints**

In this, we are given a multivariable function* f(u) *and a constraint *g(u)=0 *and we want to find the point *u* *where *g(u*)=0 *and* f(u*) *is minimum (i.e., lowest possible point while satisfying the constraint).

For instance, in the example shown the objective function is *f(u₁,u*₂*) = u₁+ u*₂ (3D plane) and the constraint is *u*²*₁+ u*²₂=1 (2D circle). The goal is to find the point (*u₁, u*₂) corresponding to the lowest point on the plane* *where *u*²*₁+ u*²₂=1 holds *(*i.e., the (*u₁, u*₂) where the lowest point on the circle projected onto the plane occurs).

One approach to solve this type of constrained optimization problems is to use the method of Lagrange multipliers. In simple terms, the Lagrange multiplier theorem states that any solution *u* *to an optimization problem of the form:

**Minimize** *f(u) ***such that** *g(u)=0*

must satisfy the equation*∇ᵤL(u*,λ)=0 *for* *some* λ*∈R* (*and trivially, *g(u*)=0) *where *L *is the Lagrangian* *function which is given by *L(u,λ)=f(u)+λg(u). *It assumes that *∇ᵤg(u*)≠0.*

It follows from this that we can solve constrained optimization problems with equality constraints as follows (assuming *∇ᵤg(u*)≠0):*

- Write the Lagrangian function
*L(u,λ)=f(u)+λg(u).* - Solve
*n+1*equations resulting from*∇ᵤL(u,λ)*= 0 (*n*equations) with*g(u)=0*to find the*n+1*unknowns*u****₁***,u**₂*,…,u**ₙ,*λ* - The solution is
*u*=(u****₁***,u**₂*,…,u**ₙ*)*

*λ *is called a Lagrange multiplier. We only need to find it because it’s part of the system that yields the solution *u*.*

You can work out the example corresponding to the figure above here. In this example, the problem is not convex and solving should yield any minimum or maximum present. Notice that steps (1) and (2) above are equivalent to performing unconstrained optimization on *L(u,λ)=f(u)+λg(u). *That is, setting:

*∇ L(u,λ)* =*(∂L(u)/∂u₁, ∂L(u)/∂u*₂, …, *∂L(u)/∂u*ₙ,* ∂L(u)/∂λ)=* 0

In this sense, this method of Lagrange multipliers is powerful in that it casts a constrained optimization problem into an unconstrained optimization problem which we can solve by simply setting the gradient as zero.

**Rationale**

It’s not hard to derive with intuition why this works. The **feasible region** is the set of points that satisfy the problem’s constraints (e.g., those on the circle above); we want to find, among such points, the point where the objective function is optimum.

We know that *∇ f(u) *points opposite to the direction of the greatest decrease (along the direction of the greatest increase). However, in our case, we are only allowed to move in the feasible region (points satisfying the constraint); thus, to minimize the function under the constraint, we should want to move in the direction of greatest decrease *along* the constraint curve (so we never exit the feasible region and reach the min).

Suppose the direction of the tangent at point *u *on the constraint curve is given by *r(u)*, then recalling the formula for vector projection, we want to move opposite to the following direction (*∇ f(u) *projection on* r(u))*:

Similar to the unconstrained case above, it should dawn on you that whenever this is 0, we can’t move in any direction along the constraint to further increase *f(u) *(if at a maximum) or decrease it (if at a minimum).

It’s obvious that for this to be zero, we need *r(u)≠0* (so the denominator isn’t zero) and *∇ f(u) ⋅ r(u)=0. *For the latter,* *We know that the normal vector on the constraint curve*∇ g(u) *is perpendicular to the tangent *r(u). *Hence, all we need is*∇ f(u) *to be parallel to *∇ g(u).*

Thereby, it must hold at the optimal point *u* *that:

- The normal to the constraint is nonzero:
*∇ g(u*)≠0*(so that*r(u*)≠0)* - The constraint is satisfied:
*g(u*)=0*(trivial requirement) *∇ f(u*)*∥*∇ g(u*)*: there exists some real β where*∇ f(u*) =*β*∇ g(u*)*

Notice that by rearranging terms and renaming -β, (3) is equivalent to “there exists some real *λ *where *∇ f(u)+λ∇ g(u)=0”. *In other words,

*∇ᵤL(u,λ)*= 0, and by that, we have intuitively derived the Lagrange multipliers theorem for one constraint (scroll up if needed).

Notice that the first condition is called a constraint qualification. If it isn’t satisfied by the constraint at a point where (2) and (3) are satisfied, then there are no guarantees that this point is optimal as the projection is not defined there.

**Multiple Equality Constraints**

When multiple constraints g*₁(u), g*₂*(u),…,g*ₖ*(u) *are present, the method smoothly generalizes to the following:

- Write the Lagrangian
*L(u,λ₁,λ*₂*,…,λ*ₖ*) = f(u) + λ₁*g*₁(u) + λ*₂g₂*(u) +…+λ*ₖgₖ*(u)* - Solve
*n+k*equations by setting*∇ᵤL(u,λ₁,λ*₂*,…,λ*ₖ*)*= 0 (*n*equations) with*g₁(u)=0, g*₂*(u)=0, …, g*ₖ*(u)=0*to find*n+k*unknowns*u****₁***,u**₂*,…,u**ₙ,*λ₁,λ*₂*,…,λ*ₖ - The solution is
*u*=(u****₁***,u**₂*,…,u**ₙ*)*

The assumption*∇ g(u*)*≠*0 *generalizes to that *∇ *g*₁(u), ∇ g*₂*(u),…,∇ g*ₖ*(u) *must be linearly independent. This is called LICQ (linear independence constraint qualification).

## Constraint Optimization with Inequality Constraints

Matters don’t get much more complex when we are rather dealing with an inequality constraint of the form *g(u)≤0*. In this case, we want the optimum point of *f(u) *that satisfies *g(u)≤0.*

For the problem above, this means that the feasible region is not just the points on the circle, it’s also the points inside it. It should be obvious that for that particular problem (and not generally), this does not change the solution.

Instead of solving the two conditions of Lagrange multipliers (2, 3) we solve a set of four conditions called KKT conditions that generalize the Lagrange multipliers case. We can derive them as follows:

Observe that with an arbitrary hypersurface *f(u) *and constraint *g(u)≤0, *there are exactly two possibilities assuming a convex smooth function with one optimum:

- The optimum point
*uᴾ*lies inside the feasible region.

- In this case, the solution to the optimization problem
*u**must be*uᴾ*and*g(u*)<0*must hold (left image). - It’s impossible to find a more optimum point in the feasible region because
*uᴾ*is the most optimal point (e.g., minimum) over the entire region (domain) of*f(u).*

2. The optimum point *uᴾ *lies outside the feasible region.

- In this case,
*f(u)*must be only decreasing in the feasible region if that point is a maximum (can’t increase again else creates another optimal point) *f(u)*must be only increasing in the feasible region if that point is a minimum (can’t decrease again else creates another optimal point)- Thus, the optimum point
*u**must be at the edge of the feasible region as it never gets better inside (*g(u*) = 0*must hold)

In the first case, it’s obvious that solving the optimization problem is equivalent to solving the unconstrained version of it.

*∇ᵤf(u) = 0*

We say that the constraint is “inactive because” it doesn’t make a difference in the optimization problem.

In the second case, it’s obvious that solving the optimization problem is equivalent to solving the equality constraint version of it (Lagrange multipliers).

The only catch for that case is that *λ *must be ≥ 0 for minimization and must be ≤ 0 for maximization. For minimization, this implies that *∇ᵤf(u) *and *∇ᵤg(u) *point in opposite directions (i.e., β in*∇ᵤ f(u) = *β*∇ ᵤg(u) *is ≤0) which has to hold because *∇ᵤg(u) *points towards the positive side of the constraint *g(u)≥0 *(basic property)*; *meanwhile, *∇ᵤ f(u) *points to the negative side of the constraint because that’s where *f(u) *is increasing. A similar argument can be easily constructed for the case of maximization.

But we don’t know beforehand which of these two cases applies. We can merge their methods as follows (assuming minimization):

- Write the Lagrangian function
*L(u,λ)=f(u)+λg(u)* - Set
*∇ᵤL(u,λ)*= 0 (*n*equations) and*g(u)≤0* - Solve for the solution (
*u****₁***,u**₂*,…,u**ₙ,*λ.*) where one of the cases above applies:

*λ=0*and*g(u*)<0 (*first case as*λ=0*means that*∇ᵤL(u,λ)*=*∇ᵤf(u) = 0*so steps 1,2 are equivalent to solving*∇ᵤf(u) = 0*)*g(u*)=0*and*λ≥0 (*second case as*g(u)=0*means that applying Lagrange is correct and that’s what we did)

We can summarize these two bullets in that *g(u*)≤0 *and *λ≥0* must hold and that *λg(u*)=0 *must hold (one of *λ *or *g(u*) *must be zero). This implies that given an optimization problem of the form:

**Minimize** *f(u) ***such that** *g(u)≤0*

We expect the following four conditions to be satisfied for the optimal point *u*:*

- Stationarity:
*∇ᵤL(u*,λ)*= 0 - Primal Feasibility:
*g(u*)≤0* - Dual Feasibility:
*λ≥0* - Complementary Slackness:
*λg(u*)=0*

and that solving these conditions together yields the optimal point u*. In reality for convex problems, these conditions are sufficient but not necessary for optimality. That is,

- If a point satisfies these conditions (e.g., found by solving them together) then that suffices to prove that the point is optimal (no need to look further for convex problems).
- Meanwhile, these conditions are not necessary for a point to be optimal. It is possible that solving the conditions gives no solution when in reality there is an optimal point that doesn’t satisfy them. For example, consider
*f(x) = x*and the constraint*x² ≤ 0*(both this and another KKT example are solved in this document)

Once we enforce a constraint qualification such as LICQ (stated earlier), we can guarantee that KKT conditions are both sufficient and necessary. An alternative constraint qualification that is easier to check is Slater’s condition which guarantees that KKT is necessary and sufficient for convex problems.

Slater’s condition simply states that the feasible region must have an interior point. That is, for a constraint *g(u)≤0 *the function must have point(s) in the domain of *f(u)* satisfying *g(u)<0. *This is a basic condition that is almost always satisfied in real-life problems (but not the counterexample above), and this means that KKT will rarely ever miss finding an optimal solution.

**Multiple Constraints**

When multiple equality constraints h*₁(u), h*₂*(u),…,h*ₖ*(u) *are present along with multiple inequality constraints g*₁(u), g*₂*(u),…,g*ₚ*(u)*, the method smoothly generalizes by writing the full Lagrangian and checking the KKT conditions only for inequality constraints and their multipliers (which we will call α):

0. Write the Lagrangian

- Set
*∇ᵤL(u,λ₁,λ*₂*,…,λ*ₖ, α*₁*, α₂, …, αₚ*)*= 0 (n equations)

2. Set h*₁(u)=0, h*₂*(u)=0, …, h*ₖ*(u)=0 (k *equations) and set

g*₁(u)≤0, g*₂*(u)≤0, …, g*ₚ*(u)≤0 *(*p* inequalities)

3. Set α*₁≥0*, α₂≥0, …, αₚ≥0 (*p* inequalities)

4. Set α*₁*g*₁(u) = *α₂*g*₂*(u) = *αₚ*g*ₖ*(u) = 0 *(*p* equations)

In total, you have *n+k+p* equations and *2p* inequalities that you will solve together to find *n+k+p *variables *(u****₁***,u**₂*,…,u**ₙ,*λ₁,λ*₂*,…,λ*ₖ,α*₁*, α₂, …, αₚ *) *which would yield the solution *u*=(u****₁***,u**₂*,…,u**ₙ*) *that minimizes the function while satisfying the* k+p* constraints.

## The Duality Principle

The duality principle simply states that for any optimization problem, we can write a dual optimization problem that when solved either tells us something about the original problem (called primal) or solves it.

For any optimization problem of the form:

The dual optimization problem takes the form:

and vice versa if it is maximization.

**Example**

For instance, the constrained optimization problem that we discussed earlier:

Has the corresponding dual problem:

As basic calculus suggests, to carry on the minimization first we do

which implies

and thereby, the optimization problem becomes

and all it takes now is to differentiate this and equate to zero to get λ = 1/√2 which implies that *(x*, y*)* = (−1/√2, −1/√2) and which is the same solution that we had by solving the primal problem via KKT.

**Deriving the Dual**

The primal (original) problem is

Suppose we define a function that returns infinity whenever *u *is not in the feasible region (doesn’t satisfy the constraint), and zero otherwise:

In this case, the primal problem is equivalent to

This should make sense because *f(u)+P(u) *sets anything outside of the feasible region to infinity and leaves the feasible region as is. The minimum of this sum must occur in the feasible region even though the constraint is enforced explicitly because infinity is greater than anything.

Observe that we can claim that:

Because with this, if:

*g(u)<0*then*P(u)=0*as defined earlier since*λ=0*must hold to maximize the quantity*λg(u)*(else it’s negative due to*g(u)<0*)*g(u)=0*then*P(u)=0*as defined earlier as*λg(u)*will be zero (*λ*can be anything)*g(u)>0*then*P(u)=*∞ as defined earlier as*λ=*∞ is what would maximize*λg(u)*

Thereby, the primal problem is equivalent to

The difference between this and the dial problem is that in the dual the Max and Min are swapped.

Thus, because in general *MinMax(…) ≥ MaxMin(…) *a solution to the dual would be a lower bound to the solution of the primal. This is called weak duality.

A more interesting case is when *MinMax(…) = MaxMin(…) *where a solution to the dual would be exactly the solution to the primal as well (as in the example). This is called strong duality. You can moderatily easily prove that the equality holds (and hence strong duality) when KKT is both necessary and sufficient. In other words, strong duality will hold for convex problems whenever Slater’s condition holds!

**So What?**

If you think about it, solving the dual problem is akin to only applying the stationarity and dual feasibility conditions of KKT on the primal problem. Instead of applying primal feasibility and complementary slackness you get to deal with an extra minimization over dual variables. In many cases, this is much easier than solving KKT on the primal problem. The extra minimization can be for instance tackled with linear or quadratic programming.

**Multiple Constraints?**

In generalizing to multiple constraints, the Lagrangian changes just as you would expect (similar to what we have seen) and we only add α≥0 conditions in maximization to multipliers associated with inequality constraints.

Hope this story has helped you truly understand unconstrained optimization, Lagrange Mulipliers, KKT and duality. Till next time, au revoir!