Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Yang

Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

May 25, 2025

Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo

Abstract:Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

Via

Access Paper or Ask Questions

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

May 20, 2025

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu

Abstract:Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.

Via

Access Paper or Ask Questions

Contact Complexity in Customer Service

Feb 24, 2024

Shu-Ting Pi, Michael Yang, Qun Liu

Abstract:Customers who reach out for customer service support may face a range of issues that vary in complexity. Routing high-complexity contacts to junior agents can lead to multiple transfers or repeated contacts, while directing low-complexity contacts to senior agents can strain their capacity to assist customers who need professional help. To tackle this, a machine learning model that accurately predicts the complexity of customer issues is highly desirable. However, defining the complexity of a contact is a difficult task as it is a highly abstract concept. While consensus-based data annotation by experienced agents is a possible solution, it is time-consuming and costly. To overcome these challenges, we have developed a novel machine learning approach to define contact complexity. Instead of relying on human annotation, we trained an AI expert model to mimic the behavior of agents and evaluate each contact's complexity based on how the AI expert responds. If the AI expert is uncertain or lacks the skills to comprehend the contact transcript, it is considered a high-complexity contact. Our method has proven to be reliable, scalable, and cost-effective based on the collected data.

* Accepted in KDD 2023 Workshop on Decision Intelligence and Analytics for Online Marketplaces

Via

Access Paper or Ask Questions

Uncovering Customer Issues through Topological Natural Language Analysis

Feb 24, 2024

Shu-Ting Pi, Sidarth Srinivasan, Yuying Zhu, Michael Yang, Qun Liu

Abstract:E-commerce companies deal with a high volume of customer service requests daily. While a simple annotation system is often used to summarize the topics of customer contacts, thoroughly exploring each specific issue can be challenging. This presents a critical concern, especially during an emerging outbreak where companies must quickly identify and address specific issues. To tackle this challenge, we propose a novel machine learning algorithm that leverages natural language techniques and topological data analysis to monitor emerging and trending customer issues. Our approach involves an end-to-end deep learning framework that simultaneously tags the primary question sentence of each customer's transcript and generates sentence embedding vectors. We then whiten the embedding vectors and use them to construct an undirected graph. From there, we define trending and emerging issues based on the topological properties of each transcript. We have validated our results through various methods and found that they are highly consistent with news sources.

* Accepted in KDD 2023 Workshop on Decision Intelligence and Analytics for Online Marketplaces

Via

Access Paper or Ask Questions

Teacher-Student Learning on Complexity in Intelligent Routing

Feb 24, 2024

Shu-Ting Pi, Michael Yang, Yuying Zhu, Qun Liu

Figure 1 for Teacher-Student Learning on Complexity in Intelligent Routing

Figure 2 for Teacher-Student Learning on Complexity in Intelligent Routing

Figure 3 for Teacher-Student Learning on Complexity in Intelligent Routing

Figure 4 for Teacher-Student Learning on Complexity in Intelligent Routing

Abstract:Customer service is often the most time-consuming aspect for e-commerce websites, with each contact typically taking 10-15 minutes. Effectively routing customers to appropriate agents without transfers is therefore crucial for e-commerce success. To this end, we have developed a machine learning framework that predicts the complexity of customer contacts and routes them to appropriate agents accordingly. The framework consists of two parts. First, we train a teacher model to score the complexity of a contact based on the post-contact transcripts. Then, we use the teacher model as a data annotator to provide labels to train a student model that predicts the complexity based on pre-contact data only. Our experiments show that such a framework is successful and can significantly improve customer experience. We also propose a useful metric called complexity AUC that evaluates the effectiveness of customer service at a statistical level.

* KDD 2023 Workshop on End-End Customer Journey Optimization

Via

Access Paper or Ask Questions

Event Camera-based Visual Odometry for Dynamic Motion Tracking of a Legged Robot Using Adaptive Time Surface

May 15, 2023

Shifan Zhu, Zhipeng Tang, Michael Yang, Erik Learned-Miller, Donghyun Kim

Figure 1 for Event Camera-based Visual Odometry for Dynamic Motion Tracking of a Legged Robot Using Adaptive Time Surface

Figure 2 for Event Camera-based Visual Odometry for Dynamic Motion Tracking of a Legged Robot Using Adaptive Time Surface

Figure 3 for Event Camera-based Visual Odometry for Dynamic Motion Tracking of a Legged Robot Using Adaptive Time Surface

Figure 4 for Event Camera-based Visual Odometry for Dynamic Motion Tracking of a Legged Robot Using Adaptive Time Surface

Abstract:Our paper proposes a direct sparse visual odometry method that combines event and RGB-D data to estimate the pose of agile-legged robots during dynamic locomotion and acrobatic behaviors. Event cameras offer high temporal resolution and dynamic range, which can eliminate the issue of blurred RGB images during fast movements. This unique strength holds a potential for accurate pose estimation of agile-legged robots, which has been a challenging problem to tackle. Our framework leverages the benefits of both RGB-D and event cameras to achieve robust and accurate pose estimation, even during dynamic maneuvers such as jumping and landing a quadruped robot, the Mini-Cheetah. Our major contributions are threefold: Firstly, we introduce an adaptive time surface (ATS) method that addresses the whiteout and blackout issue in conventional time surfaces by formulating pixel-wise decay rates based on scene complexity and motion speed. Secondly, we develop an effective pixel selection method that directly samples from event data and applies sample filtering through ATS, enabling us to pick pixels on distinct features. Lastly, we propose a nonlinear pose optimization formula that simultaneously performs 3D-2D alignment on both RGB-based and event-based maps and images, allowing the algorithm to fully exploit the benefits of both data streams. We extensively evaluate the performance of our framework on both public datasets and our own quadruped robot dataset, demonstrating its effectiveness in accurately estimating the pose of agile robots during dynamic movements.

* 8 pages, 8 figures

Via

Access Paper or Ask Questions

A Masked Bounding-Box Selection Based ResNet Predictor for Text Rotation Prediction

Sep 06, 2022

Michael Yang, Yuan Lin, ChiuMan Ho

Figure 1 for A Masked Bounding-Box Selection Based ResNet Predictor for Text Rotation Prediction

Figure 2 for A Masked Bounding-Box Selection Based ResNet Predictor for Text Rotation Prediction

Figure 3 for A Masked Bounding-Box Selection Based ResNet Predictor for Text Rotation Prediction

Figure 4 for A Masked Bounding-Box Selection Based ResNet Predictor for Text Rotation Prediction

Abstract:The existing Optical Character Recognition (OCR) systems are capable of recognizing images with horizontal texts. However, when the rotation of the texts increases, it becomes harder to recognizing these texts. The performance of the OCR systems decreases. Thus predicting the rotations of the texts and correcting the images are important. Previous work mainly uses traditional Computer Vision methods like Hough Transform and Deep Learning methods like Convolutional Neural Network. However, all of these methods are prone to background noises commonly existing in general images with texts. To tackle this problem, in this work, we introduce a new masked bounding-box selection method, that incorporating the bounding box information into the system. By training a ResNet predictor to focus on the bounding box as the region of interest (ROI), the predictor learns to overlook the background noises. Evaluations on the text rotation prediction tasks show that our method improves the performance by a large margin.

Via

Access Paper or Ask Questions

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Nov 11, 2021

Michael Yang, Aditya Anantharaman, Zachary Kitowski, Derik Clive Robert

Figure 1 for Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Figure 2 for Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Figure 3 for Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Figure 4 for Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Abstract:Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared towards this problem, where the questions require answering systems to read and reason about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. This motivates the use of 'edge features', that is, information about the relationship between each pair of objects. Some current TextVQA models address this problem but either only use categories of relations (rather than edge feature vectors) or do not use edge features within the Transformer architectures. In order to overcome these shortcomings, we propose a Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer. We find that, without using any other optimizations, the proposed GRT method outperforms the accuracy of the M4C baseline model by 0.65% on the val set and 0.57% on the test set. Qualitatively, we observe that the GRT has superior spatial reasoning ability to M4C.

* Presented as poster in CVPR 2021 Visual Question Answering Workshop

Via

Access Paper or Ask Questions

Image Captioning through Image Transformer

Apr 29, 2020

Sen He, Wentong Liao, Hamed R. Tavakoli, Michael Yang, Bodo Rosenhahn, Nicolas Pugeault

Figure 1 for Image Captioning through Image Transformer

Figure 2 for Image Captioning through Image Transformer

Figure 3 for Image Captioning through Image Transformer

Figure 4 for Image Captioning through Image Transformer

Abstract:Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.

Via

Access Paper or Ask Questions