Abstract:Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood. To investigate these behaviors, arithmetic tasks serve as important venues. In previous studies, seemingly unrelated mysteries still exist -- (1) models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition, but their effectiveness varies in more complex tasks like multiplication; (2) models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101), regardless of the positional encoding used. We believe previous studies have been treating the symptoms rather than addressing the root cause -- they have paid excessive attention to improving model components, while overlooking the differences in task properties that may be the real drivers. This is confirmed by our unified theoretical framework for different arithmetic scenarios. For example, unlike multiplication, the digital addition task has the property of translation invariance which naturally aligns with the relative positional encoding, and this combination leads to successful generalization of addition to unseen longer domains. The discrepancy in operations modulo 100 and 101 arises from the base. Modulo 100, unlike 101, is compatible with the decimal system (base 10), such that unseen information in digits beyond the units digit and the tens digit is actually not needed for the task. Extensive experiments with GPT-like models validate our theoretical predictions. These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
Abstract:We propose a deep learning framework, DL-opt, designed to efficiently solve for optimal policies in quantifiable general equilibrium trade models. DL-opt integrates (i) a nested fixed point (NFXP) formulation of the optimization problem, (ii) automatic implicit differentiation to enhance gradient descent for solving unilateral optimal policies, and (iii) a best-response dynamics approach for finding Nash equilibria. Utilizing DL-opt, we solve for non-cooperative tariffs and industrial subsidies across 7 economies and 44 sectors, incorporating sectoral external economies of scale. Our quantitative analysis reveals significant sectoral heterogeneity in Nash policies: Nash industrial subsidies increase with scale elasticities, whereas Nash tariffs decrease with trade elasticities. Moreover, we show that global dual competition, involving both tariffs and industrial subsidies, results in lower tariffs and higher welfare outcomes compared to a global tariff war. These findings highlight the importance of considering sectoral heterogeneity and policy combinations in understanding global economic competition.
Abstract:This paper aims to explore the application of machine learning in forecasting Chinese macroeconomic variables. Specifically, it employs various machine learning models to predict the quarterly real GDP growth of China, and analyzes the factors contributing to the performance differences among these models. Our findings indicate that the average forecast errors of machine learning models are generally lower than those of traditional econometric models or expert forecasts, particularly in periods of economic stability. However, during certain inflection points, although machine learning models still outperform traditional econometric models, expert forecasts may exhibit greater accuracy in some instances due to experts' more comprehensive understanding of the macroeconomic environment and real-time economic variables. In addition to macroeconomic forecasting, this paper employs interpretable machine learning methods to identify the key attributive variables from different machine learning models, aiming to enhance the understanding and evaluation of their contributions to macroeconomic fluctuations.
Abstract:This paper explores the potential impacts of large language models (LLMs) on the Chinese labor market. We analyze occupational exposure to LLM capabilities by incorporating human expertise and LLM classifications, following Eloundou et al. (2023)'s methodology. We then aggregate occupation exposure to the industry level to obtain industry exposure scores. The results indicate a positive correlation between occupation exposure and wage levels/experience premiums, suggesting higher-paying and experience-intensive jobs may face greater displacement risks from LLM-powered software. The industry exposure scores align with expert assessments and economic intuitions. We also develop an economic growth model incorporating industry exposure to quantify the productivity-employment trade-off from AI adoption. Overall, this study provides an analytical basis for understanding the labor market impacts of increasingly capable AI systems in China. Key innovations include the occupation-level exposure analysis, industry aggregation approach, and economic modeling incorporating AI adoption and labor market effects. The findings will inform policymakers and businesses on strategies for maximizing the benefits of AI while mitigating adverse disruption risks.
Abstract:Generative Transformer-based models have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not fully understood and not always satisfying. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. Curiously, it is observed that when training on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably and mysteriously on longer, unseen cases (out-of-distribution (OOD) generalization). Studies try to bridge this gap with workarounds such as modifying position embedding, fine-tuning, and priming with more extensive or instructive data. However, without addressing the essential mechanism, there is hardly any guarantee regarding the robustness of these solutions. We bring this unexplained performance drop into attention and ask whether it is purely from random errors. Here we turn to the mechanistic line of research which has notable successes in model interpretability. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. These highlight the potential of the models to carry useful information for improved generalization.
Abstract:This paper proposes a novel deep generative model, called BSDE-Gen, which combines the flexibility of backward stochastic differential equations (BSDEs) with the power of deep neural networks for generating high-dimensional complex target data, particularly in the field of image generation. The incorporation of stochasticity and uncertainty in the generative modeling process makes BSDE-Gen an effective and natural approach for generating high-dimensional data. The paper provides a theoretical framework for BSDE-Gen, describes its model architecture, presents the maximum mean discrepancy (MMD) loss function used for training, and reports experimental results.
Abstract:We introduce a new model based on sets of probabilities for sequential data. We name the model GAMMT, which stands for Generative Ambiguity Models using Multiple Transformers. We suppose that data generating process of a sequence is ambiguous and determined by a set of probabilities rather than one as in the conventional model. We use multiple parallel transformers connected by a selection mechanism to approximate ambiguous probabilities. The GAMMT allows for ambiguity modeling in a generative way and multiple representations of the input tokens and the input sequence. This work explores the combination of attention mechanism and ambiguity by deep neural networks. We expect that this framework will facilitate new research into machine learning, improving our understanding of the attention-ambiguity mechanism.