
15 posts

arXiv:2501.11638v1 Announce Type: new Abstract: Class imbalance (CI) is a longstanding problem in machine learning, slowing down training and reducing performances. Although empirical remedies exist, it is often unclear which ones work best and when, due to the lack of an overarching theory. We address a common case of imbalance, that of anomaly (or outlier) detection. We provide a theoretical framework to analyze, interpret and address CI. It is based on an exact solution of the teacher-student perceptron model, through replica theory. Within this framework, one can distinguish several sources of CI: either intrinsic, train or test imbalance. Our analysis reveals that the optimal train imbalance is generally different from 50%, with a non trivial dependence on the intrinsic imbalance, the abundance of data and on the noise in the learning. Moreover, there is a crossover between a small noise training regime where results are independent of the noise level to a high noise regime where performances quickly degrade with noise. Our results challenge some of the conventional wisdom on CI and offer practical guidelines to address it.

F. S. Pezzicoli, V. Ros, F. P. Landes, M. Baity-Jesi1/22/2025

arXiv:2410.00355v2 Announce Type: replace-cross Abstract: Finding eigenvalue distributions for a number of sparse random matrix ensembles can be reduced to solving nonlinear integral equations of the Hammerstein type. While a systematic mathematical theory of such equations exists, it has not been previously applied to sparse matrix problems. We close this gap in the literature by showing how one can employ numerical solutions of Hammerstein equations to accurately recover the spectra of adjacency matrices and Laplacians of random graphs. While our treatment focuses on random graphs for concreteness, the methodology has broad applications to more general sparse random matrices.

Pawat Akara-pipattana, Oleg Evnin1/14/2025

arXiv:2405.18874v2 Announce Type: replace-cross Abstract: The dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers. It adeptly captures semantic relationships between word pairs in sentences by computing a similarity overlap between queries and keys. In this work, we explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions to approximate ground states of quantum many-body spin Hamiltonians. Specifically, we perform numerical simulations on the two-dimensional $J_1$-$J_2$ Heisenberg model, a common benchmark in the field of quantum many-body systems on lattice. By comparing the performance of standard attention mechanisms with a simplified version that excludes queries and keys, relying solely on positions, we achieve competitive results while reducing computational cost and parameter usage. Furthermore, through the analysis of the attention maps generated by standard attention mechanisms, we show that the attention weights become effectively input-independent at the end of the optimization. We support the numerical results with analytical calculations, providing physical insights of why queries and keys should be, in principle, omitted from the attention mechanism when studying large systems.

Riccardo Rende, Luciano Loris Viteritti1/14/2025

arXiv:2501.06427v1 Announce Type: cross Abstract: It is a folklore belief in the theory of spin glasses and disordered systems that out-of-equilibrium dynamics fail to find stable local optima exhibiting e.g. local strict convexity on physical time-scales. In the context of the Sherrington--Kirkpatrick spin glass, Behrens-Arpino-Kivva-Zdeborov\'a and Minzer-Sah-Sawhney have recently conjectured that this obstruction may be inherent to all efficient algorithms, despite the existence of exponentially many such optima throughout the landscape. We prove this search problem exhibits strong low degree hardness for polynomial algorithms of degree $D\leq o(N)$: any such algorithm has probability $o(1)$ to output a stable local optimum. To the best of our knowledge, this is the first result to prove that even constant-degree polynomials have probability $o(1)$ to solve a random search problem without planted structure. To prove this, we develop a general-purpose enhancement of the ensemble overlap gap property, and as a byproduct improve previous results on spin glass optimization, maximum independent set, random $k$-SAT, and the Ising perceptron to strong low degree hardness. Finally for spherical spin glasses with no external field, we prove that Langevin dynamics does not find stable local optima within dimension-free time.

Brice Huang, Mark Sellke1/14/2025

arXiv:2501.03937v1 Announce Type: new Abstract: In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.

Hugo Cui, Cengiz Pehlevan, Yue M. Lu1/8/2025

arXiv:2409.07708v2 Announce Type: replace-cross Abstract: In feed-forward neural networks, dataset-free weight-initialization methods such as LeCun, Xavier (or Glorot), and He initializations have been developed. These methods randomly determine the initial values of weight parameters based on specific distributions (e.g., Gaussian or uniform distributions) without using training datasets. To the best of the authors' knowledge, such a dataset-free weight-initialization method is yet to be developed for restricted Boltzmann machines (RBMs), which are probabilistic neural networks consisting of two layers. In this study, we derive a dataset-free weight-initialization method for Bernoulli--Bernoulli RBMs based on statistical mechanical analysis. In the proposed weight-initialization method, the weight parameters are drawn from a Gaussian distribution with zero mean. The standard deviation of the Gaussian distribution is optimized based on our hypothesis that a standard deviation providing a larger layer correlation (LC) between the two layers improves the learning efficiency. The expression of the LC is derived based on a statistical mechanical analysis. The optimal value of the standard deviation corresponds to the maximum point of the LC. The proposed weight-initialization method is identical to Xavier initialization in a specific case (i.e., when the sizes of the two layers are the same, the random variables of the layers are $\{-1,1\}$-binary, and all bias parameters are zero). The validity of the proposed weight-initialization method is demonstrated in numerical experiments using a toy and real-world datasets.

Muneki Yasuda, Ryosuke Maeno, Chako Takahashi1/8/2025

arXiv:2309.04522v2 Announce Type: replace Abstract: Artificial neural networks have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial advances were achieved for wide networks, within two disparate theoretical frameworks: the Neural Tangent Kernel (NTK), which assumes linearized gradient descent dynamics, and the Bayesian Neural Network Gaussian Process (NNGP). We unify these two theories using gradient descent learning with an additional noise in an ensemble of wide deep networks. We construct an analytical theory for the network input-output function and introduce a new time-dependent Neural Dynamical Kernel (NDK) from which both NTK and NNGP kernels are derived. We identify two learning phases: a gradient-driven learning phase, dominated by loss minimization, in which the time scale is governed by the initialization variance. It is followed by a slow diffusive learning stage, where the parameters sample the solution space, with a time constant decided by the noise and the Bayesian prior variance. The two variance parameters strongly affect the performance in the two regimes, especially in sigmoidal neurons. In contrast to the exponential convergence of the mean predictor in the initial phase, the convergence to the equilibrium is more complex and may behave nonmonotonically. By characterizing the diffusive phase, our work sheds light on representational drift in the brain, explaining how neural activity changes continuously without degrading performance, either by ongoing gradient signals that synchronize the drifts of different synapses or by architectural biases that generate task-relevant information that is robust against the drift process. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for the learning process of deep wide neural networks and for analyzing dynamics in biological circuits.

Yehonatan Avidan, Qianyi Li, Haim Sompolinsky1/3/2025

arXiv:2412.18290v1 Announce Type: cross Abstract: Quantum reservoir computing (QRC) has emerged as a promising paradigm for harnessing near-term quantum devices to tackle temporal machine learning tasks. Yet identifying the mechanisms that underlie enhanced performance remains challenging, particularly in many-body open systems where nonlinear interactions and dissipation intertwine in complex ways. Here, we investigate a minimal model of a driven-dissipative quantum reservoir described by two coupled Kerr-nonlinear oscillators, an experimentally realizable platform that features controllable coupling, intrinsic nonlinearity, and tunable photon loss. Using Partial Information Decomposition (PID), we examine how different dynamical regimes encode input drive signals in terms of redundancy (information shared by each oscillator) and synergy (information accessible only through their joint observation). Our key results show that, near a critical point marking a dynamical bifurcation, the system transitions from predominantly redundant to synergistic encoding. We further demonstrate that synergy amplifies short-term responsiveness, thereby enhancing immediate memory retention, whereas strong dissipation leads to more redundant encoding that supports long-term memory retention. These findings elucidate how the interplay of instability and dissipation shapes information processing in small quantum systems, providing a fine-grained, information-theoretic perspective for analyzing and designing QRC platforms.

Krai Cheamsawat, Thiparat Chotibut12/25/2024

arXiv:2402.16991v3 Announce Type: replace-cross Abstract: Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organized in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying compositional structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed, but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterizes the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.

Antonio Sclocchi, Alessandro Favero, Matthieu Wyart12/25/2024

arXiv:2412.01110v3 Announce Type: replace-cross Abstract: Statistical physics provides tools for analyzing high-dimensional problems in machine learning and theoretical neuroscience. These calculations, particularly those using the replica method, often involve lengthy derivations that can obscure physical interpretation. We give concise, non-replica derivations of several key results and highlight their underlying similarities. Specifically, we introduce a cavity approach to analyzing high-dimensional learning problems and apply it to three cases: perceptron classification of points, perceptron classification of manifolds, and kernel ridge regression. These problems share a common structure -- a bipartite system of interacting feature and datum variables -- enabling a unified analysis. For perceptron-capacity problems, we identify a symmetry that allows derivation of correct capacities through a na\"ive method. These results match those obtained through the replica method.

David G. Clark, Haim Sompolinsky12/24/2024

arXiv:2412.16249v1 Announce Type: new Abstract: Behavioral experiments on the ultimatum game (UG) reveal that we humans prefer fair acts, which contradicts the prediction made in orthodox Economics. Existing explanations, however, are mostly attributed to exogenous factors within the imitation learning framework. Here, we adopt the reinforcement learning paradigm, where individuals make their moves aiming to maximize their accumulated rewards. Specifically, we apply Q-learning to UG, where each player is assigned two Q-tables to guide decisions for the roles of proposer and responder. In a two-player scenario, fairness emerges prominently when both experiences and future rewards are appreciated. In particular, the probability of successful deals increases with higher offers, which aligns with observations in behavioral experiments. Our mechanism analysis reveals that the system undergoes two phases, eventually stabilizing into fair or rational strategies. These results are robust when the rotating role assignment is replaced by a random or fixed manner, or the scenario is extended to a latticed population. Our findings thus conclude that the endogenous factor is sufficient to explain the emergence of fairness, exogenous factors are not needed.

Guozhong Zheng, Jiqiang Zhang, Xin Ou, Shengfeng Deng, Li Chen12/24/2024

arXiv:2412.16411v1 Announce Type: new Abstract: We construct a thermodynamic potential that can guide training of a generative model defined on a set of binary degrees of freedom. We argue that upon reduction in description, so as to make the generative model computationally-manageable, the potential develops multiple minima. This is mirrored by the emergence of multiple minima in the free energy proper of the generative model itself. The variety of training samples that employ N binary degrees of freedom is ordinarily much lower than the size 2^N of the full phase space. The non-represented configurations, we argue, should be thought of as comprising a high-temperature phase separated by an extensive energy gap from the configurations composing the training set. Thus, training amounts to sampling a free energy surface in the form of a library of distinct bound states, each of which breaks ergodicity. The ergodicity breaking prevents escape into the near continuum of states comprising the high-temperature phase; thus it is necessary for proper functionality. It may however have the side effect of limiting access to patterns that were underrepresented in the training set. At the same time, the ergodicity breaking within the library complicates both learning and retrieval. As a remedy, one may concurrently employ multiple generative models -- up to one model per free energy minimum.

Yang He, Vassiliy Lubchenko12/24/2024

arXiv:2405.18296v2 Announce Type: replace Abstract: Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setting, which we prove to be exact in high dimension. Notably, our analysis reveals how different properties of sub-populations influence bias at different timescales, showing a shifting preference of the classifier during training. Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias. We empirically validate our results in more complex scenarios by training deeper networks on synthetic and real datasets, including CIFAR10, MNIST, and CelebA.

Anchit Jain, Rozhin Nobahari, Aristide Baratin, Stefano Sarao Mannelli12/24/2024

arXiv:2412.15461v1 Announce Type: new Abstract: We study the exploration-exploitation trade-off for large multiplayer coordination games where players strategise via Q-Learning, a common learning framework in multi-agent reinforcement learning. Q-Learning is known to have two shortcomings, namely non-convergence and potential equilibrium selection problems, when there are multiple fixed points, called Quantal Response Equilibria (QRE). Furthermore, whilst QRE have full support for finite games, it is not clear how Q-Learning behaves as the game becomes large. In this paper, we characterise the critical exploration rate that guarantees convergence to a unique fixed point, addressing the two shortcomings above. Using a generating-functional method, we show that this rate increases with the number of players and the alignment of their payoffs. For many-player coordination games with perfectly aligned payoffs, this exploration rate is roughly twice that of $p$-player zero-sum games. As for large games, we provide a structural result for QRE, which suggests that as the game size increases, Q-Learning converges to a QRE near the boundary of the simplex of the action space, a phenomenon we term asymptotic extinction, where a constant fraction of the actions are played with zero probability at a rate $o(1/N)$ for an $N$-action game.

Desmond Chan, Bart De Keijzer, Tobias Galla, Stefanos Leonardos, Carmine Ventre12/23/2024

arXiv:2406.12916v3 Announce Type: replace Abstract: An important challenge in machine learning is to predict the initial conditions under which a given neural network will be trainable. We present a method for predicting the trainable regime in parameter space for deep feedforward neural networks (DNNs) based on reconstructing the input from subsequent activation layers via a cascade of single-layer auxiliary networks. We show that a single epoch of training of the shallow cascade networks is sufficient to predict the trainability of the deep feedforward network on a range of datasets (MNIST, CIFAR10, FashionMNIST, and white noise), thereby providing a significant reduction in overall training time. We achieve this by computing the relative entropy between reconstructed images and the original inputs, and show that this probe of information loss is sensitive to the phase behaviour of the network. We further demonstrate that this method generalizes to residual neural networks (ResNets) and convolutional neural networks (CNNs). Moreover, our method illustrates the network's decision making process by displaying the changes performed on the input data at each layer, which we demonstrate for both a DNN trained on MNIST and the vgg16 CNN trained on the ImageNet dataset. Our results provide a technique for significantly accelerating the training of large neural networks.

Yanick Thurn, Ro Jefferson, Johanna Erdmenger12/23/2024