Picture for Zsolt Kira

Zsolt Kira

SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation

Add code
Mar 29, 2026
Viaarxiv icon

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

Add code
Mar 28, 2026
Viaarxiv icon

Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets

Add code
Jan 14, 2026
Viaarxiv icon

EVE: A Generator-Verifier System for Generative Policies

Add code
Dec 24, 2025
Figure 1 for EVE: A Generator-Verifier System for Generative Policies
Figure 2 for EVE: A Generator-Verifier System for Generative Policies
Figure 3 for EVE: A Generator-Verifier System for Generative Policies
Figure 4 for EVE: A Generator-Verifier System for Generative Policies
Viaarxiv icon

Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Add code
Jul 15, 2025
Figure 1 for Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Figure 2 for Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Figure 3 for Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Figure 4 for Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Viaarxiv icon

EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction

Add code
Jul 10, 2025
Viaarxiv icon

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Add code
Jun 18, 2025
Viaarxiv icon

MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

Add code
Jun 11, 2025
Viaarxiv icon

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

Add code
Jun 09, 2025
Figure 1 for Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
Figure 2 for Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
Figure 3 for Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
Figure 4 for Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
Viaarxiv icon

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Add code
May 27, 2025
Viaarxiv icon