Skip to content

More people should know about Lagrange multipliers

July 6, 2018

One of the most useful concepts I learned during my first year of graduate school was the method of Lagrange multipliers. This is something that can seem at first like an obscure or technical piece of esoterica – I had never even heard of Lagrange multipliers during my undergraduate physics major, and I would guess that most people with technical degrees similarly don’t encounter them.  When I was first taught Lagrange multipliers, my reaction was something like “okay, I’m guessing this is just a mathematical trick used by specialists in a few specific circumstances. After all, I’ve done just fine without it so far.”

But, like many mathematical tools, Lagrange multipliers are one of those things that open doors for you.  Once you understand how to do optimization using them, whole worlds of problems open up that you would have previously thought were too hard, or had no good solution. I personally have found myself using Lagrange multipliers for everything from statistics to quantum mechanics, from electron gases to basketball.

My goal for the next few posts is to derive some of the most important equations in physics: the “distribution functions” that relate energy to probability.  But before we get there it’s worth pausing to appreciate the power of Lagrange multipliers, which will be one of the major tools that enable us to understand how nature maximizes probability.


A simple example

The basic use of Lagrange multipliers is fairly simple: they are used to find the maximum or minimum of some function in situations where you have constraints.  For example, the standard introductory problem to Lagrange multipliers is usually something like this:

Suppose that you are living on an inclined plane described by the equation z = -2x + y, but you can only move along the circle described by x2 + y2 = 1.  What is the highest point (largest z) that you can reach? What is the lowest point?



What makes this problem tricky, of course, is the relationship between x and y.  If x and y were independent of each other, then you could simply maximize the function with respect to each variable independently.  But the constraint that x2 + y2 = 1 means that you have to work harder.

If you haven’t learned the method of Lagrange multipliers, your first instinct will probably be to try and reduce the number of variables in the problem.  For example, you could try to use the constraint equation x2 + y2 = 1 to solve for y in terms of x, and then plug the solution for y into the equation that you’re trying to maximize or minimize.  Then you can hope to get the maximum or minimum by taking the derivative of z with respect to your one remaining variable, x. If you try this method, however, you’ll find that it gets messy really quickly. And heaven help you if you have a problem with many variables or many constraints – you’ll have to do a whole lot of messy solving and substituting before you get the equation down to a single variable.

The key idea behind the method of Lagrange multipliers is that, instead of trying to reduce the number of variables, you increase the number of variables by adding a set of unknown constants (called Lagrange multipliers).  What you get in exchange for increasing the number of variables, however, is a new function (commonly denoted Λ), for which all the variables are independent.  With this magic new function you can do the optimization simply by taking the derivative of Λ with respect to each variable one at a time.  This function (called the Lagrange function) is:

\Lambda(x,y, ..., \lambda_1, \lambda_2 ...) = (\textrm{function you're trying to optimize}) ...

- \lambda_1 (\textrm{first constraint equation}) - \lambda_2 (\textrm{second constraint equation}) - ...

[Here when I write “constraint equation”, I really mean “the left-hand side of a constraint equation, written so that the right-hand side is zero”.]  You can find the maximum or minimum of this function by setting all of its derivatives to zero:

\frac{\partial \Lambda}{\partial x} = \frac{\partial \Lambda}{\partial y} = ... = \frac{\partial \Lambda}{\partial \lambda_1} = \frac{\partial \Lambda}{\partial \lambda_2} = ... = 0

So in our example problem, the Lagrange function is

\Lambda = -2x + y - \lambda(x^2 + y^2 - 1).

The first part, -2x + y, is the function z that we’re trying to maximize/minimize, and the part in parentheses, (x^2 + y^2 -1), is the constraint.  The three equations that come from taking the derivatives of  are

\frac{\partial \Lambda}{\partial x} = -2 -2 \lambda x = 0

\frac{\partial \Lambda}{\partial y} = 1 - 2 \lambda y = 0

\frac{\partial \Lambda}{\partial \lambda} = x^2 + y^2 - 1 = 0.

This last equation is just a repetition of the constraint equation, but the other two are really useful.  You can manipulate them pretty easily to find that

­­­x = -1/\lambda,    y = 1/(2 \lambda)

Using the constraint equation allows you to solve for \lambda, and after a relatively painless bit of plugging and chugging you’ll arrive at two solutions:

x = -2/\sqrt{5},    y = 1/\sqrt{5},     z = \sqrt{5}

x = 2/\sqrt{5},    y = -1/\sqrt{5},    z = -\sqrt{5}.

These are the maximum and the minimum that we’re looking for.

Not bad, right?


The real power of Lagrange multipliers

What’s really great about Lagrange multipliers is not that they can solve rinky-dink little problems like the one above, where you’re looking for the best point on some function.  What’s amazing is that Lagrange multipliers can find you an optimal function.

Let’s imagine, as an example, the following contrived problem.  Suppose that there is an outdoor, open-air rock concert, and music fans crowd around the stage to hear.  In general, the density of the crowd will be highest right next to the stage, and the density will get lower as you move away.


In choosing where to stand, the audience members have to weigh the tradeoff between their desire to be close to the band and their desire to avoid a very dense crowd.  Suppose that there is some “happiness function” that weighs both of these factors together.  For the purposes of our contrived example, let’s say it’s

h = \frac{1}{1+x} - c \rho^2.

Here, h is the happiness of a person at a distance x (in some units) from the stage, c is some constant, and \rho is the density of the crowd around them.  The term 1/(1+x)  is supposed to represent the enjoyment that people get from being close to the band, which decays as you move away, while the negative term  represents a person’s discomfort at being in an extremely dense crowd.  The interesting question is: what distribution of crowd density, \rho(x), maximizes the total happiness of everyone at the concert?  In other words, what is the very best function \rho(x)?

While h(x) represents the happiness of a particular person at position x, the total happiness of everyone in the crowd is

H = \int h(x) \rho(x) dx.

That is, H is equal to the number of people \rho(x) dx in any small interval (x, x+dx) of position, multiplied by the happiness of those people, and summed over all positions.  This is the function that we will try to maximize.

The constraint on this function is that there is some fixed total number N of people in the crowd:

\int \rho(x) dx = N.

Now, using the recipe outlined above, we can write down a Lagrange function

\Lambda = H - \lambda ( \int \rho(x) dx - N).


In the previous problem, we were only trying to find optimal values of two specific variables: x and y.  Here, we are trying to find the optimal value of \rho(x) at every value of x.  So you can think that our goal is to optimize the function H with respect to infinitely many variables: one value of  for every possible position.  Beyond that conceptual generalization, however, the recipe for solving the problem is the same.  If it helps, you can imagine dividing up the set of all possible positions into discrete points: x_1, x_2, x_3, etc.  Each position x_i has a corresponding value of \rho_i and a corresponding value of the local happiness function h_i = 1/(1+x_i) - c \rho_i^2.  The function to be optimized is then just

H = h_1 \rho_1 + h_2 \rho_2 + ...

while the constraint condition is

\rho_1 + \rho_2 + ... = N.

The optimality of the Lagrange function says that

\frac{\partial \Lambda}{\partial \rho_1} = \frac{\partial \Lambda}{\partial \rho_2} = ... = 0

Let’s consider some particular point \rho_i.  The Lagrange equation

\frac{\partial \Lambda}{\partial \rho_i} = 0


\frac{\partial}{\partial \rho_i} (h_i \rho_i) - \lambda = 0

\frac{1}{1 + x_i} - 3 c \rho_i^2 - \lambda = 0.

Drop the subscript i, and you’ll see that this equation is actually telling you about the functional dependence of the density  on the position x.  In particular, solving for \rho gives

\rho(x) = \sqrt{ \frac{1}{3c} ( \frac{1}{1 + x} - \lambda) }.

The value of \lambda depends on the number of people N in the crowd – a larger crowd means \lambda gets closer to zero.  You can go back and solve for its value by doing the integral of \rho(x), but in the interest of not being too pedantic I’ll spare you the details.   The final solution for \rho(x) looks something like this:



The takeaway from this funny exercise is that Lagrange multipliers allow you to solve not just for the optimal point on some function, but for the optimal kind of function for some environment.  This is the kind of problem that I didn’t even realize was well-posed until I got to graduate school, and the ability to solve such problems is an extremely powerful tool in physics.  Indeed, it is one of the recurring themes of physics that when we want to know which laws govern nature, we start by asking “which laws would give the smallest (or largest) total amount of X?”

When it comes to asking those kinds of questions, Lagrange multipliers are like a math superpower.



A couple people have commented (on Twitter) that there is a simple pictorial way to think about Lagrange multipliers and why they work, and there’s no reason for me to make them seem like black magic.  This is true, of course, so let me try and give a quick recap of the intuitive explanation for the method.

Consider the first example in this post, where you are constrained to move along the circle x^2 + y^2 = 1.  Imagine an arrow pointing in the direction of your motion as you walk around the circle.  And now imagine also an arrow that represents the gradient of the function f you are trying to maximize (remember, the gradient of a function points in the direction of greatest increase of that function).  If the arrow for the direction of your motion points in the same direction as the gradient, then you are moving directly uphill.  If the arrow of your motion points in the opposite direction as the gradient, then you are moving directly downhill.

Most of the time, there will be some particular angle between the direction of your motion and the gradient.  This means you are moving “somewhat uphill” or “somewhat downhill”.  But at the very peak height (or at the very lowest point) of your trajectory, your motion will be exactly perpendicular to the gradient, meaning that for that instant you are moving neither uphill nor downhill.

The key idea is to imagine a function g(x,y) that represents the constraint — in our example g(x,y) = x^2 + y^2 - 1.  The constraint (the definition of the circle you are constrained to walk along) represents the contour g(x,y) = 0.  The gradient of the function g(x,y) always points perpendicular to the direction of your motion along the circle, since by definition moving along the circle does not increase the value of g(x,y).

So, putting all the pieces together, we arrive at the conclusion that at a maximum or minimum, the gradient of f points parallel to the gradient of g.

In equation form, this is

\partial_{x} f = \lambda \partial_{x} g,

\partial_{y} f = \lambda \partial_{y} g,

where \lambda is some constant.

This is exactly the Lagrange multiplier equation \partial_{x} \Lambda =  \partial_{y} \Lambda = ... = 0, with \Lambda = f - \lambda g.

If this all still feels pretty opaque, there is a very nice video series from Khan academy on this subject.

7 Comments leave one →
  1. July 7, 2018 11:13 pm

    Welcome back! Great to see you and thank you for the Lagrange multiplier primer. I’m anxious to see the next installments!

  2. July 8, 2018 9:44 pm

    Very nice, thank you!

  3. July 9, 2018 7:57 am

    I’ll join my fellow readers: welcome back and thanks for the nostalgic trip to my freshman year.

  4. Wihan permalink
    March 29, 2019 8:33 am

    Hi, one thing that I didn’t quite catch is “The value of \lambda depends on the number of people N in the crowd – a larger crowd means \lambda gets closer to zero. “. Do you mind expanding on this?


  1. How thick is the atmosphere? A derivation of the Boltmzann distribution | Gravity and Levity

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: