cond-mat.dis-nn
9 postsarXiv:2309.04522v2 Announce Type: replace Abstract: Artificial neural networks have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial advances were achieved for wide networks, within two disparate theoretical frameworks: the Neural Tangent Kernel (NTK), which assumes linearized gradient descent dynamics, and the Bayesian Neural Network Gaussian Process (NNGP). We unify these two theories using gradient descent learning with an additional noise in an ensemble of wide deep networks. We construct an analytical theory for the network input-output function and introduce a new time-dependent Neural Dynamical Kernel (NDK) from which both NTK and NNGP kernels are derived. We identify two learning phases: a gradient-driven learning phase, dominated by loss minimization, in which the time scale is governed by the initialization variance. It is followed by a slow diffusive learning stage, where the parameters sample the solution space, with a time constant decided by the noise and the Bayesian prior variance. The two variance parameters strongly affect the performance in the two regimes, especially in sigmoidal neurons. In contrast to the exponential convergence of the mean predictor in the initial phase, the convergence to the equilibrium is more complex and may behave nonmonotonically. By characterizing the diffusive phase, our work sheds light on representational drift in the brain, explaining how neural activity changes continuously without degrading performance, either by ongoing gradient signals that synchronize the drifts of different synapses or by architectural biases that generate task-relevant information that is robust against the drift process. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for the learning process of deep wide neural networks and for analyzing dynamics in biological circuits.
arXiv:2412.18290v1 Announce Type: cross Abstract: Quantum reservoir computing (QRC) has emerged as a promising paradigm for harnessing near-term quantum devices to tackle temporal machine learning tasks. Yet identifying the mechanisms that underlie enhanced performance remains challenging, particularly in many-body open systems where nonlinear interactions and dissipation intertwine in complex ways. Here, we investigate a minimal model of a driven-dissipative quantum reservoir described by two coupled Kerr-nonlinear oscillators, an experimentally realizable platform that features controllable coupling, intrinsic nonlinearity, and tunable photon loss. Using Partial Information Decomposition (PID), we examine how different dynamical regimes encode input drive signals in terms of redundancy (information shared by each oscillator) and synergy (information accessible only through their joint observation). Our key results show that, near a critical point marking a dynamical bifurcation, the system transitions from predominantly redundant to synergistic encoding. We further demonstrate that synergy amplifies short-term responsiveness, thereby enhancing immediate memory retention, whereas strong dissipation leads to more redundant encoding that supports long-term memory retention. These findings elucidate how the interplay of instability and dissipation shapes information processing in small quantum systems, providing a fine-grained, information-theoretic perspective for analyzing and designing QRC platforms.
arXiv:2402.16991v3 Announce Type: replace-cross Abstract: Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organized in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying compositional structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed, but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterizes the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.
arXiv:2412.01110v3 Announce Type: replace-cross Abstract: Statistical physics provides tools for analyzing high-dimensional problems in machine learning and theoretical neuroscience. These calculations, particularly those using the replica method, often involve lengthy derivations that can obscure physical interpretation. We give concise, non-replica derivations of several key results and highlight their underlying similarities. Specifically, we introduce a cavity approach to analyzing high-dimensional learning problems and apply it to three cases: perceptron classification of points, perceptron classification of manifolds, and kernel ridge regression. These problems share a common structure -- a bipartite system of interacting feature and datum variables -- enabling a unified analysis. For perceptron-capacity problems, we identify a symmetry that allows derivation of correct capacities through a na\"ive method. These results match those obtained through the replica method.
arXiv:2412.16411v1 Announce Type: new Abstract: We construct a thermodynamic potential that can guide training of a generative model defined on a set of binary degrees of freedom. We argue that upon reduction in description, so as to make the generative model computationally-manageable, the potential develops multiple minima. This is mirrored by the emergence of multiple minima in the free energy proper of the generative model itself. The variety of training samples that employ N binary degrees of freedom is ordinarily much lower than the size 2^N of the full phase space. The non-represented configurations, we argue, should be thought of as comprising a high-temperature phase separated by an extensive energy gap from the configurations composing the training set. Thus, training amounts to sampling a free energy surface in the form of a library of distinct bound states, each of which breaks ergodicity. The ergodicity breaking prevents escape into the near continuum of states comprising the high-temperature phase; thus it is necessary for proper functionality. It may however have the side effect of limiting access to patterns that were underrepresented in the training set. At the same time, the ergodicity breaking within the library complicates both learning and retrieval. As a remedy, one may concurrently employ multiple generative models -- up to one model per free energy minimum.
arXiv:2405.18296v2 Announce Type: replace Abstract: Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setting, which we prove to be exact in high dimension. Notably, our analysis reveals how different properties of sub-populations influence bias at different timescales, showing a shifting preference of the classifier during training. Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias. We empirically validate our results in more complex scenarios by training deeper networks on synthetic and real datasets, including CIFAR10, MNIST, and CelebA.
arXiv:2412.16249v1 Announce Type: new Abstract: Behavioral experiments on the ultimatum game (UG) reveal that we humans prefer fair acts, which contradicts the prediction made in orthodox Economics. Existing explanations, however, are mostly attributed to exogenous factors within the imitation learning framework. Here, we adopt the reinforcement learning paradigm, where individuals make their moves aiming to maximize their accumulated rewards. Specifically, we apply Q-learning to UG, where each player is assigned two Q-tables to guide decisions for the roles of proposer and responder. In a two-player scenario, fairness emerges prominently when both experiences and future rewards are appreciated. In particular, the probability of successful deals increases with higher offers, which aligns with observations in behavioral experiments. Our mechanism analysis reveals that the system undergoes two phases, eventually stabilizing into fair or rational strategies. These results are robust when the rotating role assignment is replaced by a random or fixed manner, or the scenario is extended to a latticed population. Our findings thus conclude that the endogenous factor is sufficient to explain the emergence of fairness, exogenous factors are not needed.
arXiv:2406.12916v3 Announce Type: replace Abstract: An important challenge in machine learning is to predict the initial conditions under which a given neural network will be trainable. We present a method for predicting the trainable regime in parameter space for deep feedforward neural networks (DNNs) based on reconstructing the input from subsequent activation layers via a cascade of single-layer auxiliary networks. We show that a single epoch of training of the shallow cascade networks is sufficient to predict the trainability of the deep feedforward network on a range of datasets (MNIST, CIFAR10, FashionMNIST, and white noise), thereby providing a significant reduction in overall training time. We achieve this by computing the relative entropy between reconstructed images and the original inputs, and show that this probe of information loss is sensitive to the phase behaviour of the network. We further demonstrate that this method generalizes to residual neural networks (ResNets) and convolutional neural networks (CNNs). Moreover, our method illustrates the network's decision making process by displaying the changes performed on the input data at each layer, which we demonstrate for both a DNN trained on MNIST and the vgg16 CNN trained on the ImageNet dataset. Our results provide a technique for significantly accelerating the training of large neural networks.
arXiv:2412.15461v1 Announce Type: new Abstract: We study the exploration-exploitation trade-off for large multiplayer coordination games where players strategise via Q-Learning, a common learning framework in multi-agent reinforcement learning. Q-Learning is known to have two shortcomings, namely non-convergence and potential equilibrium selection problems, when there are multiple fixed points, called Quantal Response Equilibria (QRE). Furthermore, whilst QRE have full support for finite games, it is not clear how Q-Learning behaves as the game becomes large. In this paper, we characterise the critical exploration rate that guarantees convergence to a unique fixed point, addressing the two shortcomings above. Using a generating-functional method, we show that this rate increases with the number of players and the alignment of their payoffs. For many-player coordination games with perfectly aligned payoffs, this exploration rate is roughly twice that of $p$-player zero-sum games. As for large games, we provide a structural result for QRE, which suggests that as the game size increases, Q-Learning converges to a QRE near the boundary of the simplex of the action space, a phenomenon we term asymptotic extinction, where a constant fraction of the actions are played with zero probability at a rate $o(1/N)$ for an $N$-action game.