Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lang Huang

Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up

Feb 27, 2025

Lang Huang, Qiyu Wu, Zhongtao Miao, Toshihiko Yamasaki

Abstract:Information retrieval is indispensable for today's Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process--comprising post-training adaptation followed by instruction tuning--we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms conventional methods across diverse retrieval scenarios, particularly when processing complex multi-modal inputs. Notably, the joint fusion encoder yields greater improvements on tasks that require modality fusion compared to those that do not, underscoring the transformative potential of early integration strategies and pointing toward a promising direction for contextually aware and effective information retrieval.

Via

Access Paper or Ask Questions

SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Sep 26, 2024

Zerun Wang, Liuyu Xiang, Lang Huang, Jiafeng Mao, Ling Xiao, Toshihiko Yamasaki

Figure 1 for SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Figure 2 for SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Figure 3 for SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Figure 4 for SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Abstract:Open-set semi-supervised learning (OSSL) leverages practical open-set unlabeled data, comprising both in-distribution (ID) samples from seen classes and out-of-distribution (OOD) samples from unseen classes, for semi-supervised learning (SSL). Prior OSSL methods initially learned the decision boundary between ID and OOD with labeled ID data, subsequently employing self-training to refine this boundary. These methods, however, suffer from the tendency to overtrust the labeled ID data: the scarcity of labeled data caused the distribution bias between the labeled samples and the entire ID data, which misleads the decision boundary to overfit. The subsequent self-training process, based on the overfitted result, fails to rectify this problem. In this paper, we address the overtrusting issue by treating OOD samples as an additional class, forming a new SSL process. Specifically, we propose SCOMatch, a novel OSSL method that 1) selects reliable OOD samples as new labeled data with an OOD memory queue and a corresponding update strategy and 2) integrates the new SSL process into the original task through our Simultaneous Close-set and Open-set self-training. SCOMatch refines the decision boundary of ID and OOD classes across the entire dataset, thereby leading to improved results. Extensive experimental results show that SCOMatch significantly outperforms the state-of-the-art methods on various benchmarks. The effectiveness is further verified through ablation studies and visualization.

* ECCV 2024 accepted

Via

Access Paper or Ask Questions

FIIH: Fully Invertible Image Hiding for Secure and Robust

Jul 24, 2024

Lang Huang, Lin Huo, Zheng Gan, Xinrong He

Figure 1 for FIIH: Fully Invertible Image Hiding for Secure and Robust

Figure 2 for FIIH: Fully Invertible Image Hiding for Secure and Robust

Figure 3 for FIIH: Fully Invertible Image Hiding for Secure and Robust

Figure 4 for FIIH: Fully Invertible Image Hiding for Secure and Robust

Abstract:Image hiding is the study of techniques for covert storage and transmission, which embeds a secret image into a container image and generates stego image to make it similar in appearance to a normal image. However, existing image hiding methods have a serious problem that the hiding and revealing process cannot be fully invertible, which results in the revealing network not being able to recover the secret image losslessly, which makes it impossible to simultaneously achieve high fidelity and secure transmission of the secret image in an insecure network environment. To solve this problem,this paper proposes a fully invertible image hiding architecture based on invertible neural network,aiming to realize invertible hiding of secret images,which is invertible on both data and network. Based on this ingenious architecture, the method can withstand deep learning based image steganalysis. In addition, we propose a new method for enhancing the robustness of stego images after interference during transmission. Experiments demonstrate that the FIIH proposed in this paper significantly outperforms other state-of-the-art image hiding methods in hiding a single image, and also significantly outperforms other state-of-the-art methods in robustness and security.

Via

Access Paper or Ask Questions

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

Oct 02, 2023

Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji

Figure 1 for Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

Figure 2 for Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

Figure 3 for Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

Figure 4 for Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

Abstract:Reporting bias arises when people assume that some knowledge is universally understood and hence, do not necessitate explicit elaboration. In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentially degrade models trained on them. To mitigate this bias, we propose a bimodal augmentation (BiAug) approach through object-attribute decoupling to flexibly synthesize visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives. We employ large language models (LLMs) in conjunction with a grounding object detector to extract target objects. Subsequently, the LLM generates a detailed attribute description for each object and produces a corresponding hard negative counterpart. An inpainting model is then used to create images based on these detailed object descriptions. By doing so, the synthesized examples explicitly complement omitted objects and attributes to learn, and the hard negative pairs steer the model to distinguish object attributes. Our experiments demonstrated that BiAug is superior in object-attribute understanding. In addition, BiAug also improves the performance on zero-shot retrieval tasks on general benchmarks like MSCOCO and Flickr30K. BiAug refines the way of collecting text-image datasets. Mitigating the reporting bias helps models achieve a deeper understanding of visual-language phenomena, expanding beyond mere frequent patterns to encompass the richness and diversity of real-world scenarios.

Via

Access Paper or Ask Questions

CoNe: Contrast Your Neighbours for Supervised Image Classification

Aug 21, 2023

Mingkai Zheng, Shan You, Lang Huang, Xiu Su, Fei Wang, Chen Qian, Xiaogang Wang, Chang Xu

Figure 1 for CoNe: Contrast Your Neighbours for Supervised Image Classification

Figure 2 for CoNe: Contrast Your Neighbours for Supervised Image Classification

Figure 3 for CoNe: Contrast Your Neighbours for Supervised Image Classification

Figure 4 for CoNe: Contrast Your Neighbours for Supervised Image Classification

Abstract:Image classification is a longstanding problem in computer vision and machine learning research. Most recent works (e.g. SupCon , Triplet, and max-margin) mainly focus on grouping the intra-class samples aggressively and compactly, with the assumption that all intra-class samples should be pulled tightly towards their class centers. However, such an objective will be very hard to achieve since it ignores the intra-class variance in the dataset. (i.e. different instances from the same class can have significant differences). Thus, such a monotonous objective is not sufficient. To provide a more informative objective, we introduce Contrast Your Neighbours (CoNe) - a simple yet practical learning framework for supervised image classification. Specifically, in CoNe, each sample is not only supervised by its class center but also directly employs the features of its similar neighbors as anchors to generate more adaptive and refined targets. Moreover, to further boost the performance, we propose ``distributional consistency" as a more informative regularization to enable similar instances to have a similar probability distribution. Extensive experimental results demonstrate that CoNe achieves state-of-the-art performance across different benchmark datasets, network architectures, and settings. Notably, even without a complicated training recipe, our CoNe achieves 80.8\% Top-1 accuracy on ImageNet with ResNet-50, which surpasses the recent Timm training recipe (80.4\%). Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/CoNe}{https://github.com/mingkai-zheng/CoNe}.

Via

Access Paper or Ask Questions

SimMatchV2: Semi-Supervised Learning with Graph Consistency

Aug 13, 2023

Mingkai Zheng, Shan You, Lang Huang, Chen Luo, Fei Wang, Chen Qian, Chang Xu

Abstract:Semi-Supervised image classification is one of the most fundamental problem in computer vision, which significantly reduces the need for human labor. In this paper, we introduce a new semi-supervised learning algorithm - SimMatchV2, which formulates various consistency regularizations between labeled and unlabeled data from the graph perspective. In SimMatchV2, we regard the augmented view of a sample as a node, which consists of a label and its corresponding representation. Different nodes are connected with the edges, which are measured by the similarity of the node representations. Inspired by the message passing and node classification in graph theory, we propose four types of consistencies, namely 1) node-node consistency, 2) node-edge consistency, 3) edge-edge consistency, and 4) edge-node consistency. We also uncover that a simple feature normalization can reduce the gaps of the feature norm between different augmented views, significantly improving the performance of SimMatchV2. Our SimMatchV2 has been validated on multiple semi-supervised learning benchmarks. Notably, with ResNet-50 as our backbone and 300 epochs of training, SimMatchV2 achieves 71.9\% and 76.2\% Top-1 Accuracy with 1\% and 10\% labeled examples on ImageNet, which significantly outperforms the previous methods and achieves state-of-the-art performance. Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/SimMatchV2}{https://github.com/mingkai-zheng/SimMatchV2}.

Via

Access Paper or Ask Questions

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Jul 12, 2022

Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, Chang Xu

Figure 1 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Figure 2 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Figure 3 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Figure 4 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Abstract:Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is available at https://github.com/hunto/LightViT.

* 13 pages, 7 figures, 9 tables

Via

Access Paper or Ask Questions

Green Hierarchical Vision Transformer for Masked Image Modeling

May 26, 2022

Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Toshihiko Yamasaki

Figure 1 for Green Hierarchical Vision Transformer for Masked Image Modeling

Figure 2 for Green Hierarchical Vision Transformer for Masked Image Modeling

Figure 3 for Green Hierarchical Vision Transformer for Masked Image Modeling

Figure 4 for Green Hierarchical Vision Transformer for Masked Image Modeling

Abstract:We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.

* 16 pages, 7 figures, 3 tables, 3 algorithms

Via

Access Paper or Ask Questions

Learning Where to Learn in Cross-View Self-Supervised Learning

Mar 28, 2022

Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Toshihiko Yamasaki

Figure 1 for Learning Where to Learn in Cross-View Self-Supervised Learning

Figure 2 for Learning Where to Learn in Cross-View Self-Supervised Learning

Figure 3 for Learning Where to Learn in Cross-View Self-Supervised Learning

Figure 4 for Learning Where to Learn in Cross-View Self-Supervised Learning

Abstract:Self-supervised learning (SSL) has made enormous progress and largely narrowed the gap with the supervised ones, where the representation learning is mainly guided by a projection into an embedding space. During the projection, current methods simply adopt uniform aggregation of pixels for embedding; however, this risks involving object-irrelevant nuisances and spatial misalignment for different augmentations. In this paper, we present a new approach, Learning Where to Learn (LEWEL), to adaptively aggregate spatial information of features, so that the projected embeddings could be exactly aligned and thus guide the feature learning better. Concretely, we reinterpret the projection head in SSL as a per-pixel projection and predict a set of spatial alignment maps from the original features by this weight-sharing projection head. A spectrum of aligned embeddings is thus obtained by aggregating the features with spatial weighting according to these alignment maps. As a result of this adaptive alignment, we observe substantial improvements on both image-level prediction and dense prediction at the same time: LEWEL improves MoCov2 by 1.6%/1.3%/0.5%/0.4% points, improves BYOL by 1.3%/1.3%/0.7%/0.6% points, on ImageNet linear/semi-supervised classification, Pascal VOC semantic segmentation, and object detection, respectively.

* To appear at CVPR'2022. 13 pages, 5 figures, and 9 tables

Via

Access Paper or Ask Questions

SimMatch: Semi-supervised Learning with Similarity Matching

Mar 17, 2022

Mingkai Zheng, Shan You, Lang Huang, Fei Wang, Chen Qian, Chang Xu

Figure 1 for SimMatch: Semi-supervised Learning with Similarity Matching

Figure 2 for SimMatch: Semi-supervised Learning with Similarity Matching

Figure 3 for SimMatch: Semi-supervised Learning with Similarity Matching

Figure 4 for SimMatch: Semi-supervised Learning with Similarity Matching

Abstract:Learning with few labeled data has been a longstanding problem in the computer vision and machine learning research community. In this paper, we introduced a new semi-supervised learning framework, SimMatch, which simultaneously considers semantic similarity and instance similarity. In SimMatch, the consistency regularization will be applied on both semantic-level and instance-level. The different augmented views of the same instance are encouraged to have the same class prediction and similar similarity relationship respected to other instances. Next, we instantiated a labeled memory buffer to fully leverage the ground truth labels on instance-level and bridge the gaps between the semantic and instance similarities. Finally, we proposed the \textit{unfolding} and \textit{aggregation} operation which allows these two similarities be isomorphically transformed with each other. In this way, the semantic and instance pseudo-labels can be mutually propagated to generate more high-quality and reliable matching targets. Extensive experimental results demonstrate that SimMatch improves the performance of semi-supervised learning tasks across different benchmark datasets and different settings. Notably, with 400 epochs of training, SimMatch achieves 67.2\%, and 74.4\% Top-1 Accuracy with 1\% and 10\% labeled examples on ImageNet, which significantly outperforms the baseline methods and is better than previous semi-supervised learning frameworks. Code and pre-trained models are available at https://github.com/KyleZheng1997/simmatch.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions