Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cheng-Yen Hsieh

Andy

Elucidating the Design Space of Multimodal Protein Language Models

Apr 16, 2025

Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu

Abstract:Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.

* Project Page: https://bytedance.github.io/dplm/dplm-2.1/

Via

Access Paper or Ask Questions

Tracking Any Object Amodally

Dec 19, 2023

Cheng-Yen Hsieh, Tarasha Khurana, Achal Dave, Deva Ramanan

Figure 1 for Tracking Any Object Amodally

Figure 2 for Tracking Any Object Amodally

Figure 3 for Tracking Any Object Amodally

Figure 4 for Tracking Any Object Amodally

Abstract:Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of modal annotations in most datasets. To address the scarcity of amodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse categories in thousands of video sequences. Our dataset includes amodal and modal bounding boxes for visible and occluded objects, including objects that are partially out-of-frame. To enhance amodal tracking with object permanence, we leverage a lightweight plug-in module, the amodal expander, to transform standard, modal trackers into amodal ones through fine-tuning on a few hundred video sequences with data augmentation. We achieve a 3.3\% and 1.6\% improvement on the detection and tracking of occluded objects on TAO-Amodal. When evaluated on people, our method produces dramatic improvements of 2x compared to state-of-the-art modal baselines.

* Project Page: https://tao-amodal.github.io

Via

Access Paper or Ask Questions

Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Aug 30, 2022

Cheng-Yen Hsieh, Chih-Jung Chang, Fu-En Yang, Yu-Chiang Frank Wang

Figure 1 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Figure 2 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Figure 3 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Figure 4 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Abstract:While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well to downstream tasks at patch or pixel levels. Moreover, existing SSL methods might not sufficiently describe and associate the above representations within and across image scales. In this paper, we propose a Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed to derive pyramid representations at patch levels via learning proper prototypes, with additional learners to observe and relate inherent semantic information within an image. In particular, we present a cross-scale patch-level correlation learning in SS-PRL, which allows the model to aggregate and associate information learned across patch scales. We show that, with our proposed SS-PRL for model pre-training, one can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection, and instance segmentation.

* IEEE WACV 2023, Github: https://github.com/WesleyHsieh0806/SS-PRL

Via

Access Paper or Ask Questions

C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Jul 25, 2022

Cheng-Yen Hsieh, Yu-Chuan Chuang, An-Yeu, Wu

Figure 1 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 2 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 3 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 4 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Abstract:Most existing studies improve the efficiency of Split learning (SL) by compressing the transmitted features. However, most works focus on dimension-wise compression that transforms high-dimensional features into a low-dimensional space. In this paper, we propose circular convolution-based batch-wise compression for SL (C3-SL) to compress multiple features into one single feature. To avoid information loss while merging multiple features, we exploit the quasi-orthogonality of features in high-dimensional space with circular convolution and superposition. To the best of our knowledge, we are the first to explore the potential of batch-wise compression under the SL scenario. Based on the simulation results on CIFAR-10 and CIFAR-100, our method achieves a 16x compression ratio with negligible accuracy drops compared with the vanilla SL. Moreover, C3-SL significantly reduces 1152x memory and 2.25x computation overhead compared to the state-of-the-art dimension-wise compression method.

* 6 pages, IEEE MLSP 2022, Github: https://github.com/WesleyHsieh0806/Split-Learning-Compression

Via

Access Paper or Ask Questions