University of Toronto
Abstract:Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.
Abstract:A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.
Abstract:Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but it is highly vulnerable to Byzantine attacks. Existing robust approaches can neutralize these threats but incur substantial computational overhead during high-dimensional gradient aggregation, an overhead that scales poorly with model size and increasingly dominates the training cost as modern models grow larger. To address this computational bottleneck, we propose Projected Dimensionality Reduction (PDR), a universal acceleration framework for vector-level distance-based robust aggregators, which performs robust aggregation by compressing gradients into a drastically smaller subspace via sparse random projection to efficiently compute reliability weights. This approach reduces the server computational complexity to an optimal $ \mathcal{O}(Mp) $, where $ M $ is the number of clients and $ p $ is the model dimension, matching the theoretical lower bound required merely to read the gradients. We establish convergence guarantees under standard FL assumptions in prior Byzantine-robust FL analyses. By leveraging the Subspace Embedding Theorem, we show that PDR achieves optimal convergence rates of $ \mathcal{O}(1/\sqrt{T}) $ for non-convex functions and $ \mathcal{O}(1/T) $ for strongly convex functions, where $ T $ denotes the number of iterations. Crucially, we mathematically demonstrate that this massive acceleration comes almost for free, merely inflating the inherent Byzantine error floor by a bounded, tunable factor of $ \frac{1+ε}{1-ε} $. Experimental results on benchmark datasets confirm that integrating PDR with existing aggregators yields orders of magnitude speedups in time efficiency while maintaining highly competitive convergence performance.
Abstract:Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/
Abstract:High-fidelity computational fluid dynamics (CFD) is widely used for thermal-fluid design, but repeated CFD solves remain expensive for design optimization, uncertainty analysis, and digital-twin workflows. Recently, our team has demonstrated that a proper orthogonal decomposition and neural-network (POD-NN) surrogate can predict two-dimensional thermal fields in an electronics-cooling cold plate with large inference speedups while preserving physically interpretable modal structure. Reproducing and extending such workflows, however, typically requires custom scripts for parameter sampling, Fluent automation, data extraction, reduced-order model construction, neural-network training, validation, and prediction. This paper introduces CFDTwin, an open-source Python package and optional desktop graphical user interface (GUI) that packages these steps into a reusable workflow for ANSYS Fluent simulations. CFDTwin allows users to define simulation inputs and output quantities, generate design-of-experiments samples, run and resume Fluent batch simulations, train POD-NN surrogate models for scalar, surface-field, and cell-zone outputs, inspect validation metrics, and evaluate trained models at new design points without re-running Fluent. The same workflow is exposed through a scriptable Python API and a GUI, supporting reproducible studies, user-facing model validation, and automated design exploration. CFDTwin extends the prior POD-NN modeling study from a case-specific research implementation to a reusable research-software platform for CFD surrogate modeling and digital-twin development.
Abstract:Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.
Abstract:Data-driven modeling is becoming central to multiphase transport, electronics cooling, acoustic diagnostics, and thermal-fluid digital twins, but progress is limited by fragmented datasets and raw instrument files that are difficult to decode, reuse, or benchmark. This paper presents an open ecosystem of multimodal datasets and open-source software packages developed by the Nano Energy and Data-Driven Discovery (NED3) Laboratory for reproducible AI-enabled thermal-fluid research. We introduce a spatial-plus-temporal dimensionality framework, denoted S+TD, to classify datasets by the dimensionality of measured or simulated fields, including 0+0D point values, 0+1D time series, 1+0D profiles, 2+0D images, 2+1D videos, 3+0D volumetric fields, and multimodal combinations. We organize public NED3 datasets spanning boiling images, acoustic and thermal measurements, high-speed videos, infrared thermography, thermal-resistance measurements, CFD-generated fields, design files, and acoustic-emission data. We also describe complementary software packages, including BubbleID, SeqReg, CFDTwin, IRISApp, decode-wfs, AELab, and FlowLab, which support computer vision, sequence regression, surrogate modeling, infrared analysis, waveform decoding, acoustic-emission analysis, and multimodal diagnostics. Particular emphasis is placed on SeqReg, a general sequence-regression library for 0+1D, 1+1D, and 2+1D data, with applications such as nonintrusive heat-flux estimation. Finally, we discuss future community efforts to build interoperable thermal-fluid databanks and curated AI/ML tool libraries that connect datasets, metadata, decoders, baselines, benchmarks, and physically interpretable models.
Abstract:Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at https://github.com/VirtualLUOUCAS/Chronicles-OCR.
Abstract:On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.
Abstract:We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.