Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zihe Wang

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Jun 10, 2025

Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Jihyung Kil, Wei-Lun Chao

Abstract:The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

* First two authors contribute equally

Via

Access Paper or Ask Questions

Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation

Jan 12, 2025

Zhenyang Feng, Zihe Wang, Saul Ibaven Bueno, Tomasz Frelek, Advikaa Ramesh, Jingyan Bai, Lemeng Wang, Zanming Huang, Jianyang Gu, Jinsu Yoo(+15 more)

Figure 1 for Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation

Figure 2 for Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation

Figure 3 for Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation

Figure 4 for Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation

Abstract:We study image segmentation in the biological domain, particularly trait and part segmentation from specimen images (e.g., butterfly wing stripes or beetle body parts). This is a crucial, fine-grained task that aids in understanding the biology of organisms. The conventional approach involves hand-labeling masks, often for hundreds of images per species, and training a segmentation model to generalize these labels to other images, which can be exceedingly laborious. We present a label-efficient method named Static Segmentation by Tracking (SST). SST is built upon the insight: while specimens of the same species have inherent variations, the traits and parts we aim to segment show up consistently. This motivates us to concatenate specimen images into a ``pseudo-video'' and reframe trait and part segmentation as a tracking problem. Concretely, SST generates masks for unlabeled images by propagating annotated or predicted masks from the ``pseudo-preceding'' images. Powered by Segment Anything Model 2 (SAM~2) initially developed for video segmentation, we show that SST can achieve high-quality trait and part segmentation with merely one labeled image per species -- a breakthrough for analyzing specimen images. We further develop a cycle-consistent loss to fine-tune the model, again using one labeled image. Additionally, we highlight the broader potential of SST, including one-shot instance segmentation on images taken in the wild and trait-based image retrieval.

Via

Access Paper or Ask Questions

Freshness and Informativity Weighted Cognitive Extent and Its Correlation with Cumulative Citation Count

Dec 04, 2024

Zihe Wang, Jian Wu

Figure 1 for Freshness and Informativity Weighted Cognitive Extent and Its Correlation with Cumulative Citation Count

Figure 2 for Freshness and Informativity Weighted Cognitive Extent and Its Correlation with Cumulative Citation Count

Figure 3 for Freshness and Informativity Weighted Cognitive Extent and Its Correlation with Cumulative Citation Count

Figure 4 for Freshness and Informativity Weighted Cognitive Extent and Its Correlation with Cumulative Citation Count

Abstract:In this paper, we revisit cognitive extent, originally defined as the number of unique phrases in a quota. We introduce Freshness and Informative Weighted Cognitive Extent (FICE), calculated based on two novel weighting factors, the lifetime ratio and informativity of scientific entities. We model the lifetime of each scientific entity as the time-dependent document frequency, which is fit by the composition of multiple Gaussian profiles. The lifetime ratio is then calculated as the cumulative document frequency at the publication time $t_0$ divided by the cumulative document frequency over its entire lifetime. The informativity is calculated by normalizing the document frequency across all scientific entities recognized in a title. Using the ACL Anthology, we verified the trend formerly observed in several other domains that the number of unique scientific entities per quota increased gradually at a slower rate. We found that FICE exhibits a strong correlation with the average cumulative citation count within a quota. Our code is available at \href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}

Via

Access Paper or Ask Questions

Are High-Degree Representations Really Unnecessary in Equivariant Graph Neural Networks?

Oct 15, 2024

Jiacheng Cen, Anyi Li, Ning Lin, Yuxiang Ren, Zihe Wang, Wenbing Huang

Abstract:Equivariant Graph Neural Networks (GNNs) that incorporate E(3) symmetry have achieved significant success in various scientific applications. As one of the most successful models, EGNN leverages a simple scalarization technique to perform equivariant message passing over only Cartesian vectors (i.e., 1st-degree steerable vectors), enjoying greater efficiency and efficacy compared to equivariant GNNs using higher-degree steerable vectors. This success suggests that higher-degree representations might be unnecessary. In this paper, we disprove this hypothesis by exploring the expressivity of equivariant GNNs on symmetric structures, including $k$-fold rotations and regular polyhedra. We theoretically demonstrate that equivariant GNNs will always degenerate to a zero function if the degree of the output representations is fixed to 1 or other specific values. Based on this theoretical insight, we propose HEGNN, a high-degree version of EGNN to increase the expressivity by incorporating high-degree steerable vectors while maintaining EGNN's efficiency through the scalarization trick. Our extensive experiments demonstrate that HEGNN not only aligns with our theoretical analyses on toy datasets consisting of symmetric structures, but also shows substantial improvements on more complicated datasets such as $N$-body and MD17. Our theoretical findings and empirical results potentially open up new possibilities for the research of equivariant GNNs.

Via

Access Paper or Ask Questions

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Jul 23, 2024

Jihyung Kil, Zheda Mai, Justin Lee, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, Arpita Chowdhury, Wei-Lun Chao

Figure 1 for CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Figure 2 for CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Figure 3 for CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Figure 4 for CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Abstract:The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use CompBench to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe CompBench not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.

Via

Access Paper or Ask Questions

A Survey of Geometric Graph Neural Networks: Data Structures, Models and Applications

Mar 01, 2024

Jiaqi Han, Jiacheng Cen, Liming Wu, Zongzhao Li, Xiangzhe Kong, Rui Jiao, Ziyang Yu, Tingyang Xu, Fandi Wu, Zihe Wang(+5 more)

Abstract:Geometric graph is a special kind of graph with geometric features, which is vital to model many scientific problems. Unlike generic graphs, geometric graphs often exhibit physical symmetries of translations, rotations, and reflections, making them ineffectively processed by current Graph Neural Networks (GNNs). To tackle this issue, researchers proposed a variety of Geometric Graph Neural Networks equipped with invariant/equivariant properties to better characterize the geometry and topology of geometric graphs. Given the current progress in this field, it is imperative to conduct a comprehensive survey of data structures, models, and applications related to geometric GNNs. In this paper, based on the necessary but concise mathematical preliminaries, we provide a unified view of existing models from the geometric message passing perspective. Additionally, we summarize the applications as well as the related datasets to facilitate later research for methodology development and experimental evaluation. We also discuss the challenges and future potential directions of Geometric GNNs at the end of this survey.

Via

Access Paper or Ask Questions

Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation

Sep 12, 2023

Yake Wei, Ruoxuan Feng, Zihe Wang, Di Hu

Abstract:One primary topic of multi-modal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multi-modal cooperation, which could not jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but are often hard to provide the fine-grained observation of multi-modal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a fine-grained modality valuation metric to evaluate the contribution of each modality at sample-level. Via modality valuation, we regretfully observe that the multi-modal model tends to rely on one specific modality, resulting in other modalities being low-contributing. We further analyze this issue and improve cooperation between modalities by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution at sample-level and achieve considerable improvement on different multi-modal models.

* 7 pages

Via

Access Paper or Ask Questions

Identifying Ransomware Actors in the Bitcoin Network

Aug 28, 2021

Siddhartha Dalal, Zihe Wang, Siddhanth Sabharwal

Figure 1 for Identifying Ransomware Actors in the Bitcoin Network

Figure 2 for Identifying Ransomware Actors in the Bitcoin Network

Figure 3 for Identifying Ransomware Actors in the Bitcoin Network

Figure 4 for Identifying Ransomware Actors in the Bitcoin Network

Abstract:Due to the pseudo-anonymity of the Bitcoin network, users can hide behind their bitcoin addresses that can be generated in unlimited quantity, on the fly, without any formal links between them. Thus, it is being used for payment transfer by the actors involved in ransomware and other illegal activities. The other activity we consider is related to gambling since gambling is often used for transferring illegal funds. The question addressed here is that given temporally limited graphs of Bitcoin transactions, to what extent can one identify common patterns associated with these fraudulent activities and apply them to find other ransomware actors. The problem is rather complex, given that thousands of addresses can belong to the same actor without any obvious links between them and any common pattern of behavior. The main contribution of this paper is to introduce and apply new algorithms for local clustering and supervised graph machine learning for identifying malicious actors. We show that very local subgraphs of the known such actors are sufficient to differentiate between ransomware, random and gambling actors with 85% prediction accuracy on the test data set.

* MLIOB-2 Conference, 2021

Via

Access Paper or Ask Questions

LOCCNet: a machine learning framework for distributed quantum information processing

Jan 28, 2021

Xuanqiang Zhao, Benchi Zhao, Zihe Wang, Zhixin Song, Xin Wang

Figure 1 for LOCCNet: a machine learning framework for distributed quantum information processing

Figure 2 for LOCCNet: a machine learning framework for distributed quantum information processing

Figure 3 for LOCCNet: a machine learning framework for distributed quantum information processing

Figure 4 for LOCCNet: a machine learning framework for distributed quantum information processing

Abstract:Distributed quantum information processing is essential for building quantum networks and enabling more extensive quantum computations. In this regime, several spatially separated parties share a multipartite quantum system, and the most natural set of operations are Local Operations and Classical Communication (LOCC). As a pivotal part in quantum information theory and practice, LOCC has led to many vital protocols such as quantum teleportation. However, designing practical LOCC protocols is challenging due to LOCC's intractable structure and limitations set by near-term quantum devices. Here we introduce LOCCNet, a machine learning framework facilitating protocol design and optimization for distributed quantum information processing tasks. As applications, we explore various quantum information tasks such as entanglement distillation, quantum state discrimination, and quantum channel simulation. We discover novel protocols with evident improvements, in particular, for entanglement distillation with quantum states of interest in quantum information. Our approach opens up new opportunities for exploring entanglement and its applications with machine learning, which will potentially sharpen our understanding of the power and limitations of LOCC.

* 18 pages

Via

Access Paper or Ask Questions

Bounded Incentives in Manipulating the Probabilistic Serial Rule

Jan 28, 2020

Zihe Wang, Zhide Wei, Jie Zhang

Figure 1 for Bounded Incentives in Manipulating the Probabilistic Serial Rule

Abstract:The Probabilistic Serial mechanism is well-known for its desirable fairness and efficiency properties. It is one of the most prominent protocols for the random assignment problem. However, Probabilistic Serial is not incentive-compatible, thereby these desirable properties only hold for the agents' declared preferences, rather than their genuine preferences. A substantial utility gain through strategic behaviors would trigger self-interested agents to manipulate the mechanism and would subvert the very foundation of adopting the mechanism in practice. In this paper, we characterize the extent to which an individual agent can increase its utility by strategic manipulation. We show that the incentive ratio of the mechanism is $\frac{3}{2}$. That is, no agent can misreport its preferences such that its utility becomes more than 1.5 times of what it is when reports truthfully. This ratio is a worst-case guarantee by allowing an agent to have complete information about other agents' reports and to figure out the best response strategy even if it is computationally intractable in general. To complement this worst-case study, we further evaluate an agent's utility gain on average by experiments. The experiments show that an agent' incentive in manipulating the rule is very limited. These results shed some light on the robustness of Probabilistic Serial against strategic manipulation, which is one step further than knowing that it is not incentive-compatible.

* To appear in AAAI 2020

Via

Access Paper or Ask Questions