Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul Smolensky

TRA: Better Length Generalisation with Threshold Relative Attention

Apr 02, 2025

Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao

Abstract:Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve generalisation capabilities of decoder only transformers.

Via

Access Paper or Ask Questions

Compositional Generalization Across Distributional Shifts with Sparse Tree Operations

Dec 18, 2024

Paul Soulos, Henry Conklin, Mattia Opper, Paul Smolensky, Jianfeng Gao, Roland Fernandez

Abstract:Neural networks continue to struggle with compositional generalization, and this issue is exacerbated by a lack of massive pre-training. One successful approach for developing neural systems which exhibit human-like compositional generalization is \textit{hybrid} neurosymbolic techniques. However, these techniques run into the core issues that plague symbolic approaches to AI: scalability and flexibility. The reason for this failure is that at their core, hybrid neurosymbolic models perform symbolic computation and relegate the scalable and flexible neural computation to parameterizing a symbolic system. We investigate a \textit{unified} neurosymbolic system where transformations in the network can be interpreted simultaneously as both symbolic and neural computation. We extend a unified neurosymbolic architecture called the Differentiable Tree Machine in two central ways. First, we significantly increase the model's efficiency through the use of sparse vector representations of symbolic structures. Second, we enable its application beyond the restricted set of tree2tree problems to the more general class of seq2seq problems. The improved model retains its prior generalization capabilities and, since there is a fully neural path through the network, avoids the pitfalls of other neurosymbolic techniques that elevate symbolic computation over neural computation.

* NeurIPS 2024. Code available at https://github.com/psoulos/sdtm

Via

Access Paper or Ask Questions

Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks

Oct 23, 2024

Paul Smolensky, Roland Fernandez, Zhenghao Herbert Zhou, Mattia Opper, Jianfeng Gao

Abstract:Large Language Models (LLMs) have demonstrated impressive abilities in symbol processing through in-context learning (ICL). This success flies in the face of decades of predictions that artificial neural networks cannot master abstract symbol manipulation. We seek to understand the mechanisms that can enable robust symbol processing in transformer networks, illuminating both the unanticipated success, and the significant limitations, of transformers in symbol processing. Borrowing insights from symbolic AI on the power of Production System architectures, we develop a high-level language, PSL, that allows us to write symbolic programs to do complex, abstract symbol processing, and create compilers that precisely implement PSL programs in transformer networks which are, by construction, 100% mechanistically interpretable. We demonstrate that PSL is Turing Universal, so the work can inform the understanding of transformer ICL in general. The type of transformer architecture that we compile from PSL programs suggests a number of paths for enhancing transformers' capabilities at symbol processing. (Note: The first section of the paper gives an extended synopsis of the entire paper.)

* 101 pages (including 30 pages of Appendices), 18 figures

Via

Access Paper or Ask Questions

Implicit Chain of Thought Reasoning via Knowledge Distillation

Nov 02, 2023

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber

Abstract:To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.

Via

Access Paper or Ask Questions

Differentiable Tree Operations Promote Compositional Generalization

Jun 01, 2023

Paul Soulos, Edward Hu, Kate McCurdy, Yunmo Chen, Roland Fernandez, Paul Smolensky, Jianfeng Gao

Abstract:In the context of structure-to-structure transformation tasks, learning sequences of discrete symbolic operations poses significant challenges due to their non-differentiability. To facilitate the learning of these symbolic sequences, we introduce a differentiable tree interpreter that compiles high-level symbolic tree operations into subsymbolic matrix operations on tensors. We present a novel Differentiable Tree Machine (DTM) architecture that integrates our interpreter with an external memory and an agent that learns to sequentially select tree operations to execute the target transformation in an end-to-end manner. With respect to out-of-distribution compositional generalization on synthetic semantic parsing and language generation tasks, DTM achieves 100% while existing baselines such as Transformer, Tree Transformer, LSTM, and Tree2Tree LSTM achieve less than 30%. DTM remains highly interpretable in addition to its perfect performance.

* ICML 2023. Code available at https://github.com/psoulos/dtm

Via

Access Paper or Ask Questions

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Dec 21, 2022

Najoung Kim, Tal Linzen, Paul Smolensky

Figure 1 for Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Figure 2 for Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Figure 3 for Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Figure 4 for Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Abstract:Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.

* Preprint

Via

Access Paper or Ask Questions

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Aug 11, 2022

Paul Soulos, Sudha Rao, Caitlin Smith, Eric Rosen, Asli Celikyilmaz, R. Thomas McCoy, Yichen Jiang, Coleman Haley, Roland Fernandez, Hamid Palangi(+2 more)

Figure 1 for Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Figure 2 for Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Figure 3 for Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Figure 4 for Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

Abstract:Machine translation has seen rapid progress with the advent of Transformer-based models. These models have no explicit linguistic structure built into them, yet they may still implicitly learn structured relationships by attending to relevant tokens. We hypothesize that this structural learning could be made more robust by explicitly endowing Transformers with a structural bias, and we investigate two methods for building in such a bias. One method, the TP-Transformer, augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We test these methods on translating from English into morphologically rich languages, Turkish and Inuktitut, and consider both automatic metrics and human evaluations. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset. In sum, structural encoding methods make Transformers more sample-efficient, enabling them to perform better from smaller amounts of data.

* Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
* Revised edition to 4th Workshop on Technologies for MT of Low Resource Languages

Via

Access Paper or Ask Questions

Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems

May 02, 2022

Paul Smolensky, R. Thomas McCoy, Roland Fernandez, Matthew Goldrick, Jianfeng Gao

Figure 1 for Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems

Figure 2 for Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems

Figure 3 for Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems

Figure 4 for Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems

Abstract:What explains the dramatic progress from 20th-century to 21st-century AI, and how can the remaining limitations of current AI be overcome? The widely accepted narrative attributes this progress to massive increases in the quantity of computational and data resources available to support statistical learning in deep artificial neural networks. We show that an additional crucial factor is the development of a new type of computation. Neurocompositional computing adopts two principles that must be simultaneously respected to enable human-level cognition: the principles of Compositionality and Continuity. These have seemed irreconcilable until the recent mathematical discovery that compositionality can be realized not only through discrete methods of symbolic computing, but also through novel forms of continuous neural computing. The revolutionary recent progress in AI has resulted from the use of limited forms of neurocompositional computing. New, deeper forms of neurocompositional computing create AI systems that are more robust, accurate, and comprehensible.

* 21 pages, 6 figures. For a general AI audience: to appear in AI Magazine. A more extensive presentation of this work is "Neurocompositional computing in human and machine intelligence: A tutorial", Microsoft Technical Report MSR-TR-2022-5; see https://www.microsoft.com/en-us/research/publication/neurocompositional-computing-in-human-and-machine-intelligence-a-tutorial/

Via

Access Paper or Ask Questions

How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

Nov 18, 2021

R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, Asli Celikyilmaz

Figure 1 for How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

Figure 2 for How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

Figure 3 for How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

Figure 4 for How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

Abstract:Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure - e.g., individual dependencies - model-generated text is substantially less novel than our baseline of human-generated text from each model's test set. For larger-scale structure - e.g., overall sentence structure - model-generated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from the training set. We also perform extensive manual analysis showing that GPT-2's novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues (e.g., being self-contradictory).

* 10 pages, plus 39 pages of appendices

Via

Access Paper or Ask Questions

Distributed neural encoding of binding to thematic roles

Oct 24, 2021

Matthias Lalisse, Paul Smolensky

Figure 1 for Distributed neural encoding of binding to thematic roles

Figure 2 for Distributed neural encoding of binding to thematic roles

Figure 3 for Distributed neural encoding of binding to thematic roles

Figure 4 for Distributed neural encoding of binding to thematic roles

Abstract:A framework and method are proposed for the study of constituent composition in fMRI. The method produces estimates of neural patterns encoding complex linguistic structures, under the assumption that the contributions of individual constituents are additive. Like usual techniques for modeling compositional structure in fMRI, the proposed method employs pattern superposition to synthesize complex structures from their parts. Unlike these techniques, superpositions are sensitive to the structural positions of constituents, making them irreducible to structure-indiscriminate ("bag-of-words") models of composition. Reanalyzing data from a study by Frankland and Greene (2015), it is shown that comparison of neural predictive models with differing specifications can illuminate aspects of neural representational contents that are not apparent when composition is not modelled. The results indicate that the neural instantiations of the binding of fillers to thematic roles in a sentence are non-orthogonal, and therefore spatially overlapping.

* Originally presented as a poster MACSIM 8 (2019)

Via

Access Paper or Ask Questions