Abstract:Empirical power--law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution--Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse--grained dynamical description of training. Within GRSD, power--law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse--grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log--shift invariance of renormalized shell couplings. We further show that power--law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log--shift invariance is combined with the intrinsic time--rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power--law form.
Abstract:Neural scaling laws and double-descent phenomena suggest that deep-network training obeys a simple macroscopic structure despite highly nonlinear optimization dynamics. We derive such structure directly from gradient descent in function space. For mean-squared error loss, the training error evolves as $\dot e_t=-M(t)e_t$ with $M(t)=J_{θ(t)}J_{θ(t)}^{\!*}$, a time-dependent self-adjoint operator induced by the network Jacobian. Using Kato perturbation theory, we obtain an exact system of coupled modewise ODEs in the instantaneous eigenbasis of $M(t)$. To extract macroscopic behavior, we introduce a logarithmic spectral-shell coarse-graining and track quadratic error energy across shells. Microscopic interactions within each shell cancel identically at the energy level, so shell energies evolve only through dissipation and external inter-shell interactions. We formalize this via a \emph{renormalizable shell-dynamics} assumption, under which cumulative microscopic effects reduce to a controlled net flux across shell boundaries. Assuming an effective power-law spectral transport in a relevant resolution range, the shell dynamics admits a self-similar solution with a moving resolution frontier and explicit scaling exponents. This framework explains neural scaling laws and double descent, and unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics.
Abstract:Empirical scaling laws describe how test loss and other performance metrics depend on model size, dataset size, and compute. While such laws are consistent within specific regimes, apparently distinct scaling behaviors have been reported for related settings such as model compression. Motivated by recent progress in spectral analyses of neural representations, this paper develops a \emph{generalized spectral framework} that unifies learning dynamics and compression phenomena under a common functional ansatz. We generalize the spectral evolution function from the linear kernel form $g(λt)=λt$ to an asymptotically polynomial function $g(λ,t;β)$, characterized by an effective spectral--temporal elasticity $ρ(β)$. This framework recovers existing lazy and feature-learning theories as special cases and yields an invariant relation between learning and compression
Abstract:Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.
Abstract:Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free''. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.
Abstract:We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner's objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner's payoff structure. We demonstrate the effectiveness of this approach if the learner's algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.
Abstract:This paper explores the intricate behavior of deep neural networks (DNNs) through the lens of neuron activation dynamics. We propose a probabilistic framework that can analyze models' neuron activation patterns as a stochastic process, uncovering theoretical insights into neural scaling laws, such as over-parameterization and the power-law decay of loss with respect to dataset size. By deriving key mathematical relationships, we present that the number of activated neurons increases in the form of $N(1-(\frac{bN}{D+bN})^b)$, and the neuron activation should follows power-law distribution. Based on these two mathematical results, we demonstrate how DNNs maintain generalization capabilities even under over-parameterization, and we elucidate the phase transition phenomenon observed in loss curves as dataset size plotted in log-axis (i.e. the data magnitude increases linearly). Moreover, by combining the above two phenomenons and the power-law distribution of neuron activation, we derived the power-law decay of neural network's loss function as the data size scale increases. Furthermore, our analysis bridges the gap between empirical observations and theoretical underpinnings, offering experimentally testable predictions regarding parameter efficiency and model compressibility. These findings provide a foundation for understanding neural network scaling and present new directions for optimizing DNN performance.




Abstract:In recent decades, there has been substantial advances in time series models and benchmarks across various individual tasks, such as time series forecasting, classification, and anomaly detection. Meanwhile, compositional reasoning in time series is prevalent in real-world applications (e.g., decision-making and compositional question answering) and is in great demand. Unlike simple tasks that primarily focus on predictive accuracy, compositional reasoning emphasizes the synthesis of diverse information from both time series data and various domain knowledge, making it distinct and extremely more challenging. In this paper, we introduce Compositional Time Series Reasoning, a new task of handling intricate multistep reasoning tasks from time series data. Specifically, this new task focuses on various question instances requiring structural and compositional reasoning abilities on time series data, such as decision-making and compositional question answering. As an initial attempt to tackle this novel task, we developed TS-Reasoner, a program-aided approach that utilizes large language model (LLM) to decompose a complex task into steps of programs that leverage existing time series models and numerical subroutines. Unlike existing reasoning work which only calls off-the-shelf modules, TS-Reasoner allows for the creation of custom modules and provides greater flexibility to incorporate domain knowledge as well as user-specified constraints. We demonstrate the effectiveness of our method through a comprehensive set of experiments. These promising results indicate potential opportunities in the new task of time series reasoning and highlight the need for further research.




Abstract:With recent advances in building foundation models for texts and video data, there is a surge of interest in foundation models for time series. A family of models have been developed, utilizing a temporal auto-regressive generative Transformer architecture, whose effectiveness has been proven in Large Language Models. While the empirical results are promising, almost all existing time series foundation models have only been tested on well-curated ``benchmark'' datasets very similar to texts. However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the uni-directional nature of temporally auto-regressive decoding limits the incorporation of domain knowledge, such as physical laws expressed as partial differential equations (PDEs). To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that employs a denoising diffusion paradigm instead of temporal auto-regressive generation. TimeDiT leverages the Transformer architecture to capture temporal dependencies and employs diffusion processes to generate high-quality candidate samples without imposing stringent assumptions on the target distribution via novel masking schemes and a channel alignment strategy. Furthermore, we propose a finetuning-free model editing strategy that allows the seamless integration of external knowledge during the sampling process without updating any model parameters. Extensive experiments conducted on a varity of tasks such as forecasting, imputation, and anomaly detection, demonstrate the effectiveness of TimeDiT.




Abstract:Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. Existing works show that appropriate prompt design, such as Chain-of-Thoughts, can unlock LLM's powerful capacity in diverse areas. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, existing prompting strategies either suffers from insufficient expressive power or intermediate errors triggered by hallucination. To make LLM more discerning to such intermediate errors, we propose to guide LLM with a Divide-and-Conquer program that simultaneously ensures superior expressive power and disentangles task decomposition, sub-task resolution, and resolution assembly process. Theoretic analysis reveals that our strategy can guide LLM to extend the expressive power of fixed-depth Transformer. Experiments indicate that our proposed method can achieve better performance than typical prompting strategies in tasks bothered by intermediate errors and deceptive contents, such as large integer multiplication, hallucination detection and misinformation detection.