The Newton-Muon Optimizer
Summary
The paper introduces the Newton-Muon optimization method, a variant of the Muon optimizer, leveraging a quadratic surrogate model to approximate the loss function in Large Language Models (LLMs). Newton-Muon aims to improve optimization efficiency by incorporating the geometry of input data and has shown empirical advantages over the standard Muon method.
The text discusses the Newton–Muon method for updating weight matrices in machine learning models, involving estimating the unknown displacement second moment ΣW. It introduces the isotropic proxy ΣW / Im as a coarse approximation for ΣW, particularly useful during initialization and early training stages.
A comparison between the Newton-Muon optimization algorithm and other methods on a short track task using a GPT-2 architecture shows promising results, with Newton-Muon achieving a lower validation loss and reduced training time. Experiments on short and medium track configurations further demonstrate the effectiveness of Newton-Muon compared to AdamW, especially in stabilizing computation and achieving high scores in weight matrix analysis.
The study introduces a triplet quadratic surrogate model providing a local second-order perspective of Muon, leading to the development of Newton-Muon as a new optimizer. Empirical results on benchmark configurations show that Newton-Muon reaches the target validation loss more efficiently. Suggestions for future work include exploring structured approximations to the Hessian matrix and refining optimization strategies for enhanced performance.
The text also delves into the convergence properties of gradient descent, Newton-Muon, and Muon optimization algorithms, aiming to reach specific target values for parameters within a set number of iterations. Additionally, it discusses methods for comparing different update directions in optimization algorithms, highlighting the importance of score functions and numerical evaluations to optimize performance effectively.