Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Gutfreund

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

May 29, 2025

Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong(+1 more)

Figure 1 for Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Figure 2 for Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Figure 3 for Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Figure 4 for Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Abstract:Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

Via

Access Paper or Ask Questions

LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

May 31, 2024

Amin Heyrani Nobari, Akash Srivastava, Dan Gutfreund, Kai Xu, Faez Ahmed

Figure 1 for LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

Figure 2 for LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

Figure 3 for LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

Figure 4 for LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

Abstract:In this paper, we introduce LInK, a novel framework that integrates contrastive learning of performance and design space with optimization techniques for solving complex inverse problems in engineering design with discrete and continuous variables. We focus on the path synthesis problem for planar linkage mechanisms. By leveraging a multi-modal and transformation-invariant contrastive learning framework, LInK learns a joint representation that captures complex physics and design representations of mechanisms, enabling rapid retrieval from a vast dataset of over 10 million mechanisms. This approach improves precision through the warm start of a hierarchical unconstrained nonlinear optimization algorithm, combining the robustness of traditional optimization with the speed and adaptability of modern deep learning methods. Our results on an existing benchmark demonstrate that LInK outperforms existing methods with 28 times less error compared to a state-of-the-art approach while taking 20 times less time on an existing benchmark. Moreover, we introduce a significantly more challenging benchmark, named LINK-ABC, which involves synthesizing linkages that trace the trajectories of English capital alphabets - an inverse design benchmark task that existing methods struggle with due to large non-linearities and tiny feasible space. Our results demonstrate that LInK not only advances the field of mechanism design but also broadens the applicability of contrastive learning and optimization to other areas of engineering.

Via

Access Paper or Ask Questions

Learning from Invalid Data: On Constraint Satisfaction in Generative Models

Jun 27, 2023

Giorgio Giannone, Lyle Regenwetter, Akash Srivastava, Dan Gutfreund, Faez Ahmed

Abstract:Generative models have demonstrated impressive results in vision, language, and speech. However, even with massive datasets, they struggle with precision, generating physically invalid or factually incorrect data. This is particularly problematic when the generated data must satisfy constraints, for example, to meet product specifications in engineering design or to adhere to the laws of physics in a natural scene. To improve precision while preserving diversity and fidelity, we propose a novel training mechanism that leverages datasets of constraint-violating data points, which we consider invalid. Our approach minimizes the divergence between the generative distribution and the valid prior while maximizing the divergence with the invalid distribution. We demonstrate how generative models like GANs and DDPMs that we augment to train with invalid data vastly outperform their standard counterparts which solely train on valid data points. For example, our training procedure generates up to 98 % fewer invalid samples on 2D densities, improves connectivity and stability four-fold on a stacking block problem, and improves constraint satisfaction by 15 % on a structural topology optimization benchmark in engineering design. We also analyze how the quality of the invalid data affects the learning procedure and the generalization properties of models. Finally, we demonstrate significant improvements in sample efficiency, showing that a tenfold increase in valid samples leads to a negligible difference in constraint satisfaction, while less than 10 % invalid samples lead to a tenfold improvement. Our proposed mechanism offers a promising solution for improving precision in generative models while preserving diversity and fidelity, particularly in domains where constraint satisfaction is critical and data is limited, such as engineering design, robotics, and medicine.

Via

Access Paper or Ask Questions

Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Feb 11, 2023

Lyle Regenwetter, Akash Srivastava, Dan Gutfreund, Faez Ahmed

Figure 1 for Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Figure 2 for Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Figure 3 for Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Figure 4 for Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Abstract:Deep generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, and Transformers, have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery. However, when applied to engineering design problems, evaluating the performance of these models can be challenging, as traditional statistical metrics based on likelihood may not fully capture the requirements of engineering applications. This paper doubles as a review and a practical guide to evaluation metrics for deep generative models (DGMs) in engineering design. We first summarize well-accepted `classic' evaluation metrics for deep generative models grounded in machine learning theory and typical computer science applications. Using case studies, we then highlight why these metrics seldom translate well to design problems but see frequent use due to the lack of established alternatives. Next, we curate a set of design-specific metrics which have been proposed across different research communities and can be used for evaluating deep generative models. These metrics focus on unique requirements in design and engineering, such as constraint satisfaction, functional performance, novelty, and conditioning. We structure our review and discussion as a set of practical selection criteria and usage guidelines. Throughout our discussion, we apply the metrics to models trained on simple 2-dimensional example problems. Finally, to illustrate the selection process and classic usage of the presented metrics, we evaluate three deep generative models on a multifaceted bicycle frame design problem considering performance target achievement, design novelty, and geometric constraints. We publicly release the code for the datasets, models, and metrics used throughout the paper at decode.mit.edu/projects/metrics/.

Via

Access Paper or Ask Questions

3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphics

Feb 07, 2023

Guangyao Zhou, Nishad Gothoskar, Lirui Wang, Joshua B. Tenenbaum, Dan Gutfreund, Miguel Lázaro-Gredilla, Dileep George, Vikash K. Mansinghka

Figure 1 for 3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphics

Figure 2 for 3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphics

Figure 3 for 3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphics

Figure 4 for 3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphics

Abstract:A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between 3D graphics and real-world data. We propose a novel 3D Neural Embedding Likelihood (3DNEL) over RGB-D images to address this gap. 3DNEL uses neural embeddings to predict 2D-3D correspondences from RGB and combines this with depth in a principled manner. 3DNEL is trained entirely from synthetic images and generalizes to real-world data. To showcase this capability, we develop a multi-stage inverse graphics pipeline that uses 3DNEL for 6D object pose estimation from real RGB-D images. Our method outperforms the previous state-of-the-art in sim-to-real pose estimation on the YCB-Video dataset, and improves robustness, with significantly fewer large-error predictions. Unlike existing bottom-up, discriminative approaches that are specialized for pose estimation, 3DNEL adopts a probabilistic generative formulation that jointly models multi-object scenes. This generative formulation enables easy extension of 3DNEL to additional tasks like object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.

Via

Access Paper or Ask Questions

LINKS: A dataset of a hundred million planar linkage mechanisms for data-driven kinematic design

Aug 30, 2022

Amin Heyrani Nobari, Akash Srivastava, Dan Gutfreund, Faez Ahmed

Figure 1 for LINKS: A dataset of a hundred million planar linkage mechanisms for data-driven kinematic design

Figure 2 for LINKS: A dataset of a hundred million planar linkage mechanisms for data-driven kinematic design

Figure 3 for LINKS: A dataset of a hundred million planar linkage mechanisms for data-driven kinematic design

Figure 4 for LINKS: A dataset of a hundred million planar linkage mechanisms for data-driven kinematic design

Abstract:In this paper, we introduce LINKS, a dataset of 100 million one degree of freedom planar linkage mechanisms and 1.1 billion coupler curves, which is more than 1000 times larger than any existing database of planar mechanisms and is not limited to specific kinds of mechanisms such as four-bars, six-bars, \etc which are typically what most databases include. LINKS is made up of various components including 100 million mechanisms, the simulation data for each mechanism, normalized paths generated by each mechanism, a curated set of paths, the code used to generate the data and simulate mechanisms, and a live web demo for interactive design of linkage mechanisms. The curated paths are provided as a measure for removing biases in the paths generated by mechanisms that enable a more even design space representation. In this paper, we discuss the details of how we can generate such a large dataset and how we can overcome major issues with such scales. To be able to generate such a large dataset we introduce a new operator to generate 1-DOF mechanism topologies, furthermore, we take many steps to speed up slow simulations of mechanisms by vectorizing our simulations and parallelizing our simulator on a large number of threads, which leads to a simulation 800 times faster than the simple simulation algorithm. This is necessary given on average, 1 out of 500 candidates that are generated are valid~(and all must be simulated to determine their validity), which means billions of simulations must be performed for the generation of this dataset. Then we demonstrate the depth of our dataset through a bi-directional chamfer distance-based shape retrieval study where we show how our dataset can be used directly to find mechanisms that can trace paths very close to desired target paths.

* Code & Data: https://github.com/ahnobari/LINKS

Via

Access Paper or Ask Questions

Solving the Baby Intuitions Benchmark with a Hierarchically Bayesian Theory of Mind

Aug 04, 2022

Tan Zhi-Xuan, Nishad Gothoskar, Falk Pollok, Dan Gutfreund, Joshua B. Tenenbaum, Vikash K. Mansinghka

Figure 1 for Solving the Baby Intuitions Benchmark with a Hierarchically Bayesian Theory of Mind

Figure 2 for Solving the Baby Intuitions Benchmark with a Hierarchically Bayesian Theory of Mind

Figure 3 for Solving the Baby Intuitions Benchmark with a Hierarchically Bayesian Theory of Mind

Abstract:To facilitate the development of new models to bridge the gap between machine and human social intelligence, the recently proposed Baby Intuitions Benchmark (arXiv:2102.11938) provides a suite of tasks designed to evaluate commonsense reasoning about agents' goals and actions that even young infants exhibit. Here we present a principled Bayesian solution to this benchmark, based on a hierarchically Bayesian Theory of Mind (HBToM). By including hierarchical priors on agent goals and dispositions, inference over our HBToM model enables few-shot learning of the efficiency and preferences of an agent, which can then be used in commonsense plausibility judgements about subsequent agent behavior. This approach achieves near-perfect accuracy on most benchmark tasks, outperforming deep learning and imitation learning baselines while producing interpretable human-like inferences, demonstrating the advantages of structured Bayesian models of human social cognition.

* 6 pages, 2 figures. Presented at the Robotics: Science and Systems 2022 Workshop on Social Intelligence in Humans and Robots

Via

Access Paper or Ask Questions

Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Jul 07, 2022

Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba

Figure 1 for Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Figure 2 for Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Figure 3 for Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Figure 4 for Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Abstract:The way an object looks and sounds provide complementary reflections of its physical properties. In many settings cues from vision and audition arrive asynchronously but must be integrated, as when we hear an object dropped on the floor and then must find it. In this paper, we introduce a setting in which to study multi-modal object localization in 3D virtual environments. An object is dropped somewhere in a room. An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics. To study this problem, we have generated a large-scale dataset -- the Fallen Objects dataset -- that includes 8000 instances of 30 physical object categories in 64 rooms. The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex physical interactions between objects in a photorealistic setting. As a first step toward addressing this challenge, we develop a set of embodied agent baselines, based on imitation learning, reinforcement learning, and modular planning, and perform an in-depth analysis of the challenge of this new task.

* CVPR 2022. Project page: http://fallen-object.csail.mit.edu

Via

Access Paper or Ask Questions

3DP3: 3D Scene Perception via Probabilistic Programming

Oct 30, 2021

Nishad Gothoskar, Marco Cusumano-Towner, Ben Zinberg, Matin Ghavamizadeh, Falk Pollok, Austin Garrett, Joshua B. Tenenbaum, Dan Gutfreund, Vikash K. Mansinghka

Figure 1 for 3DP3: 3D Scene Perception via Probabilistic Programming

Figure 2 for 3DP3: 3D Scene Perception via Probabilistic Programming

Figure 3 for 3DP3: 3D Scene Perception via Probabilistic Programming

Figure 4 for 3DP3: 3D Scene Perception via Probabilistic Programming

Abstract:We present 3DP3, a framework for inverse graphics that uses inference in a structured generative model of objects, scenes, and images. 3DP3 uses (i) voxel models to represent the 3D shape of objects, (ii) hierarchical scene graphs to decompose scenes into objects and the contacts between them, and (iii) depth image likelihoods based on real-time graphics. Given an observed RGB-D image, 3DP3's inference algorithm infers the underlying latent 3D scene, including the object poses and a parsimonious joint parametrization of these poses, using fast bottom-up pose proposals, novel involutive MCMC updates of the scene graph structure, and, optionally, neural object detectors and pose estimators. We show that 3DP3 enables scene understanding that is aware of 3D shape, occlusion, and contact structure. Our results demonstrate that 3DP3 is more accurate at 6DoF object pose estimation from real images than deep learning baselines and shows better generalization to challenging scenes with novel viewpoints, contact, and partial observability.

* NeurIPS 2021

Via

Access Paper or Ask Questions

Incorporating Rich Social Interactions Into MDPs

Oct 22, 2021

Ravi Tejwani, Yen-Ling Kuo, Tianmin Shu, Bennett Stankovits, Dan Gutfreund, Joshua B. Tenenbaum, Boris Katz, Andrei Barbu

Figure 1 for Incorporating Rich Social Interactions Into MDPs

Figure 2 for Incorporating Rich Social Interactions Into MDPs

Figure 3 for Incorporating Rich Social Interactions Into MDPs

Figure 4 for Incorporating Rich Social Interactions Into MDPs

Abstract:Much of what we do as humans is engage socially with other agents, a skill that robots must also eventually possess. We demonstrate that a rich theory of social interactions originating from microsociology and economics can be formalized by extending a nested MDP where agents reason about arbitrary functions of each other's hidden rewards. This extended Social MDP allows us to encode the five basic interactions that underlie microsociology: cooperation, conflict, coercion, competition, and exchange. The result is a robotic agent capable of executing social interactions zero-shot in new environments; like humans it can engage socially in novel ways even without a single example of that social interaction. Moreover, the judgments of these Social MDPs align closely with those of humans when considering which social interaction is taking place in an environment. This method both sheds light on the nature of social interactions, by providing concrete mathematical definitions, and brings rich social interactions into a mathematical framework that has proven to be natural for robotics, MDPs.

* Submitted to the 39th IEEE Conference on Robotics and Automation (ICRA 2022). Do not distribute

Via

Access Paper or Ask Questions