I’m trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness.
What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model.
Go back to your history: Cauchy is the earliest person I’m aware of to have used gradient descent, and he motivated it as
one ordinarily starts by reducing them to a single one by successive eliminations, to eventually solve for good the resulting equation, if possible. But it is important to observe that 1◦ in many cases, the elimination cannot be performed in any way; 2◦ the resulting equation is usually very complicated, even though the given equations are rather simple
That is, the usefulness of gradient descent is motivated when you have rough idea of when you are close to the minimum, but you don’t want to go through the hassle of algebra. (realistically, if you can solve it with gradient descent, you could probably solve it algebraicly, we just don’t have the same stupidly easy to implement computational routines for it)
https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf
the error function
6th degree polynomial?
Here’s an example of fitting a 3rd order polynomial I worked with recently that might be useful for you. The code is largely part of the pytorch documentation, with an animation to illustrate the process.
https://gist.github.com/jlines/00eae9590d26b31cb98b333e537fcb31
How about this function:
f(x) = abs(x) - cos(x)
Setting the first derivative equal to zero leads to trouble, as the derivative doesn’t exist at the minimum, and there are infinitely many points which have zero slope. Nevertheless, the function has a clear minimum which gradient descent of any finite step size should find.
ignoring the fact that you can’t prove anything about gradient descent for nonsmooth functions (you need subgradients), a finite step size will NOT work. You can end up at the minimum and compute a subgradient which is different from zero there, and end up moving AWAY from the minimizer. You need to decay your step size to ensure that you end up at the minimizer (asymptotically).
Except you could get stuck at an arbitary saddle point
Isn’t the probability of getting exactly on one 0 though?
of course, every locally lipschitz function is differentiable almost everywhere, i.e., the set of points where it is not differentiable is a measure zero set. So if you drew a point randomly (with distribution absolutely continuous w.r.t. the lebesgue measure) then you avoid such points with probability. However, this has NOTHING to do with actually performing an algorithm like gradient descent - in these algorithms you are NOT simply drawing points randomly and you cannot guarantee you avoid these points a priori.
It does answer OP’s question, but is of limited practical relevance for an ML course IMHO.
We typically use GD in approximately pseudoconvex optimization landscapes, not because there are infinitely many or even any single saddle point. To escape local optima and saddle points, we rely on other tricks like SGD.
Linear regression with a L1 regularization term is a great example. While in some cases a closed form exists (if we only have a single predictor, or if X is orthonormal), generally gradient descent is used.
To be precise: linear regression with L2 loss and L1 regularization. L1 regression with L1 penalty can be solved using a linear program.
A simple example is when you have a set of n points, and you want to find a local maximum of “distance away from the closest point in the set”. The gradient is easy to approximate, and you can show visually what’s happening to students. There are algorithms to solve this exactly but they are much more complicated than gradient descent.
For one step more complexity you could do an svm algorithm. Ie, divide points into red and blue, and find the line that separates them into two groups, maximizing the margin to the closest point. This is a real problem for which gradient descent (well, some variant of it) is the best solution. It’s also nicely visualizable.
Logistic Regression is a simple machine learning model with no closed form solution.
Why is this not the most upvoted answer?
You can write a log likelihood function and take its derivative and set it equal to zero.
no, to convince yourself, just try to do it.
I do not understand most comments here.
Gradient descent is just the same as the tangent method. It is ubiquitously used, e.g. find minimum of whatever polynomial of degree >= 4.
Calculating derivative and finding 0 of the derivative is still the same problem. You look for a numerical solution using gradient descents. Bisection is slower and not so effective for multivariate functions.
I would say the opposite: there are more optimisation problems where gradient descent is used than not (excluding everything which can be solved by linear systems)
This reminds me that I still didn’t read the paper on forward-forward algorithm and thus I’m not even sure if it’s still gradient descent.
Give an example of linear regression that does not support zero conditional mean assumption. Like auto regressive process. You cannot take derivative then because it is not differentiable. However gradient descent doesn’t care about all that
abs(x)
?Abs is an example of a function for which gradient descent does not work. No matter how small the step, it oscillate near zero.
Curious about this answer, I use AI to help me understand a lot of what happens with questions like these, but as its not my expertise, I am always suspect of the answers i get. So I sent your question to Phind:
A good example of a function that is difficult to optimize by setting its first derivative to zero, but can be easily optimized using gradient descent, is the function f(x) = x*ln(x) - sqrt(x)
The derivative of this function is f’(x) = ln(x) + 1 - 1/(2*sqrt(x)). Setting this equal to zero does not yield a solution that can be expressed in terms of elementary functions. However, we can easily apply the gradient descent method to find a numerical solution.
*********************************
it also generated a bunch of example code in Python. Curious as to the validity of the response.
Find the minimum or maximum of an n-dimensional matrix of data instead of a function. Generate the data with any nice to visualise function. The move on from gradient descent to simulated annealing.
Functions that are exponential + polynomial, or trigonometric function + polynomial in general do not have analytical solutions for finding the roots.
So you could use "minimize f(x)=ex +x2\ " as an example.