Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chia-Yu Hung

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Nov 18, 2025

Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria

Figure 1 for NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Figure 2 for NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Figure 3 for NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Figure 4 for NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Abstract:Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.

* https://declare-lab.github.io/nora-1.5

Via

Access Paper or Ask Questions

10 Open Challenges Steering the Future of Vision-Language-Action Models

Nov 08, 2025

Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu

Abstract:Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

* AAAI 2026 (Senior Track)

Via

Access Paper or Ask Questions

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Jul 28, 2025

Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria

Figure 1 for JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Figure 2 for JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Figure 3 for JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Figure 4 for JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Abstract:Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

* https://github.com/declare-lab/jamify

Via

Access Paper or Ask Questions

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Apr 28, 2025

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, Soujanya Poria

Figure 1 for NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Figure 2 for NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Figure 3 for NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Figure 4 for NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Abstract:Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

Via

Access Paper or Ask Questions

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Dec 30, 2024

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria

Abstract:We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

* https://tangoflux.github.io/

Via

Access Paper or Ask Questions

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Jun 25, 2024

Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, Soujanya Poria

Figure 1 for Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Figure 2 for Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Figure 3 for Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Figure 4 for Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Abstract:The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

Via

Access Paper or Ask Questions

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Apr 16, 2024

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

Figure 1 for Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Figure 2 for Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Figure 3 for Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Figure 4 for Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Abstract:Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

* https://github.com/declare-lab/tango

Via

Access Paper or Ask Questions

Who Wrote it and Why? Prompting Large-Language Models for Authorship Verification

Oct 12, 2023

Chia-Yu Hung, Zhiqiang Hu, Yujia Hu, Roy Ka-Wei Lee

Figure 1 for Who Wrote it and Why? Prompting Large-Language Models for Authorship Verification

Figure 2 for Who Wrote it and Why? Prompting Large-Language Models for Authorship Verification

Figure 3 for Who Wrote it and Why? Prompting Large-Language Models for Authorship Verification

Figure 4 for Who Wrote it and Why? Prompting Large-Language Models for Authorship Verification

Abstract:Authorship verification (AV) is a fundamental task in natural language processing (NLP) and computational linguistics, with applications in forensic analysis, plagiarism detection, and identification of deceptive content. Existing AV techniques, including traditional stylometric and deep learning approaches, face limitations in terms of data requirements and lack of explainability. To address these limitations, this paper proposes PromptAV, a novel technique that leverages Large-Language Models (LLMs) for AV by providing step-by-step stylometric explanation prompts. PromptAV outperforms state-of-the-art baselines, operates effectively with limited training data, and enhances interpretability through intuitive explanations, showcasing its potential as an effective and interpretable solution for the AV task.

* 7 pages,1 figure

Via

Access Paper or Ask Questions