There are a variety of interesting functions that can hone in on specific characteristics and weaknesses of an optimization method without the need for otherwise hand-crafted inputs or other datasets. Wikipedia has a selection of several such functions here:
https://www.wikiwand.com/en/Test_functions_for_optimization
One of them is the Himmelblau function. It is interesting in that it has four locations with minima, which also happen to be roots, each of which an optimization method could fall into, depending on its method, convergence properties and initializations.
https://www.wikiwand.com/en/Himmelblau%27s_function
Through the many available packages for scientific computing in general and machine learning in particular, it takes little effort and code to run many popular optimization methods, root finding approaches, and a host of other mathematical algorithms. A lot of well known ones have made their way into the Python package scipy. Notationally, using R is similarly or, at times, more convenient.
To get into the topic, the Nelder-Mead method is relatively intuitive and easy to understand:
https://www.wikiwand.com/en/Nelder%E2%80%93Mead_method
The article uses terms like "polytope" or "simplex" to describe its process but it's easier to look at it in the case of a plane for the first time and illustrate its procedure to oneself there. Then, roughly speaking, all it does is pick three points and successively eliminate the worst of them, in terms of the objective function being optimized, by replacing that point with a weighting of the other two points. That's very rough.
There is an entire module within scipy for optimization. In particular, we are interested in the minimize function:
from scipy.optimize import minimize
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
It provides more than a dozen optimization methods, among them Nelder-Mead, Powell, Conjugate Gradient, and many others. It's very easy to call this method, once you see an example at work and Python's compact notation makes it quite pleasant to pass a method. Let's minimize something simple, like this parabola that is moved one unit right and two up:
minimize(lambda x:(x-1)**2+2,0)
Output:
fun: 2.0
hess_inv: array([[0.4999999973]])
jac: array([0.])
message: 'Optimization terminated successfully.'
nfev: 9
nit: 2
njev: 3
status: 0
success: True
x: array([0.9999999947])
It found the minimum we expect, at x≈0.999, f(x)=2. There is also some information about the Hessian and the Jacobian in the output, but it was not necessary to pass any of that by hand and was inferred from the function, through numerical evaluation. The value for "nit" means that we arrived at the optimum in just two steps, from our initialization value of zero. It's also possible to just select a random value from our optimization domain, but that gets much more interesting with more complex examples.
Going by the Wikipedia article, swapping this for the Himmelblau function is very easy:
f = lambda x: (x[0]**2+x[1]-11)**2+(x[0]+x[1]**2-7)**2
The only slight inconvenience going from one to two parameters is that the function argument needs to be a list, but some might even prefer writing it this way, as it looks closer to a parameter vector this way, with indexed vector components, perhaps.
Let's run a few different optimization approaches, all at once:
[(m,minimize(f, [0,0], method=m)) for m in ['Nelder-Mead','CG','TNC']]
Output:
[('Nelder-Mead',
final_simplex: (array([[3.0000063249, 1.9999685321],
[2.999988756 , 2.0000500264],
[2.9999820172, 1.9999701156]]), array([0.0000000143, 0.000000036 , 0.0000000379]))
fun: 1.4333178246209053e-08
message: 'Optimization terminated successfully.'
nfev: 157
nit: 81
status: 0
success: True
x: array([3.0000063249, 1.9999685321])),
('CG',
fun: 2.499158120290287e-13
jac: array([-0.0000055068, -0.0000017166])
message: 'Optimization terminated successfully.'
nfev: 88
nit: 9
njev: 22
status: 0
success: True
x: array([2.9999999213, 1.9999999884])),
('TNC',
fun: 3.750719350416987e-14
jac: array([ 0.000000957 , -0.0000010897])
message: 'Converged (|f_n-f_(n-1)| ~= 0)'
nfev: 31
nit: 9
status: 1
success: True
x: array([3.0000000213, 1.9999999504]))]
To various degrees of precision, they all find one of the four minima of the Himmelblau function, at (3,2), where the function has a root, i.e. is zero. If instead a maximum is sought, it's simple to just invert the function by adding a minus sign in front and then flipping the function value it locates. The output above also contains some other information, such as the number of iterations used.
There are other neat test functions, some of many more local minima, such as the Rastrigin function.
https://www.wikiwand.com/en/Rastrigin_function
In such cases, there are many far more recent and creative solutions, such as Particle Swarm Optimization.
https://www.wikiwand.com/en/Particle_swarm_optimization
There appears to be no ready implementation in scipy, but there is this Python package:
https://pythonhosted.org/pyswarm/
By sending out not just a few points, as in the case of Nelder-Mead, but many "boid" like units/creatures, a wider range of the search space can initially be covered and it becomes less likely that the optimization method falls into one of the (bad) local minima, through bad unlucky/bad initialization. People who have played the first Half-Life might also remember swarms of alien creatures flying about in the final stage of the game, called Xen. Those were also called "boids" and it seems that some of its developers might have taken inspiration from a paper on this, dating all the way back to 1987 (but I don't remember if the term "boid" was used already then, although, I believe it was used in the Half-Life source code):
Reynolds, Craig W. Flocks, Herds and Schools: A Distributed Behavioral Model. Vol. 21. 4. ACM, 1987.
https://www.researchgate.net/profile/Craig_Reynolds2/publication/2797343_Flocks_Herds_and_Schools_A_Distributed_Behavioral_Model/links/0fcfd5095a869204df000000.pdf
https://half-life.fandom.com/wiki/Boid
(I understand the two don't really have anything to do with each other thematically, but we're supposed to have fun in this place, no? I want to imagine that in Particle Swarm Optimization there are a bunch of those Half-Life critters flying around the optimization space and you can't stop me. Better, yet, next time you read anything about search spaces you might have caught the brain-bug now, too, and will be thinking of this.)
Now, the pyswarm documentation contains an even more interesting example, one that illustrates a pretty believable scenario how one could use such optimization methods in engineering:
https://pythonhosted.org/pyswarm/#an-engineering-example
The documentation also contains this other example, using yet another of the test functions from above, the so-called "banana function", which is also more formally known as the Rosenbrock function.
https://www.wikiwand.com/en/Rosenbrock_function
This is another good test case to compare against simpler gradient based methods, because the long value of that function can easily slow or even stall some gradient descent approaches (which is where 2-nd order optimizations can help, among others):
from pyswarm import pso
def banana(x):
x1 = x[0]
x2 = x[1]
return x1**4 - 2*x2*x1**2 + x2**2 + x1**2 - 2*x1 + 5
def con(x):
x1 = x[0]
x2 = x[1]
return [-(x1 + 0.25)**2 + 0.75*x2]
lb = [-3, -1]
ub = [2, 6]
xopt, fopt = pso(banana, lb, ub, f_ieqcons=con)
Let's try it on our Himmelblau function, too:
pso(f, [-10,-10], [10,10])
Stopping search: Swarm best objective change less than 1e-08
(array([3.0000147429, 1.9999783645]), 9.620201330693181e-09)
Again, we arrive at (3,2). But let's run it a few more times. As we send out a swarm of particles looking for minima, and they are all equal, we will get different ones, on different runs, such as the following:
(array([ 3.5844650823, -1.8481161356]), 7.496411272811221e-08)
Verify by plugging the values back in:
x, y = pso(f, [-10,-10], [10,10])
x, y, f(x)
(array([-3.7792845138, -3.2831597675]),
4.978451430217623e-08,
4.978451430217623e-08)
x, y = pso(f, [-10,-10], [10,10])
x, y, f(x)
(array([2.9999975639, 2.0000102583]),
1.5087482584245445e-09,
1.5087482584245445e-09)
Pretty cool, I would say.
This was my ML snippet. I wrote some related code today and thought I might share a few auxiliary methods I did not end up using for a post. I hope it either contained a few interesting nuggets or we can at least sit together in appreciation how convenient life has been made for us by the many scientific Python/R packages out there.
Other sources:
https://www.microsoft.com/en-us/research/video/2nd-order-optimization-neural-network-training/
Martens, James. “Deep Learning via Hessian-Free Optimization.” In ICML, 27:735–742, 2010.
https://www.youtube.com/watch?v=xKuT8f7gSXA
"Gordon Freeman in the flesh. Or rather, in the hazard suit. I took the liberty of relieving you of your weapons; most of them were government property. As for the suit, I think you’ve earned it."
RIP Freeman Dyson
Thank you for the source, Axel Gembe.
Schmidhuber is right, Hinton is wrong.
there doesn't seem to be anything here