Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuefeng Hu

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Jan 20, 2026

Matthew Gwilliam, Xiao Wang, Xuefeng Hu, Zhenheng Yang

Abstract:Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.

* 18 pages, 16 tables, 4 figures

Via

Access Paper or Ask Questions

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

May 16, 2025

Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang

Abstract:Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: https://github.com/shallowdream204/DiCo.

* 27 pages, 29 figures, 9 tables

Via

Access Paper or Ask Questions

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Jun 18, 2024

Xuefeng Hu, Ke Zhang, Min Sun, Albert Chen, Cheng-Hao Kuo, Ram Nevatia

Figure 1 for BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Figure 2 for BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Figure 3 for BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Figure 4 for BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Abstract:Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using R\'enyi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.

* Preprint updated from our earlier manuscript submitted to ICLR 2024 (https://openreview.net/forum?id=KNtcoAM5Gy)

Via

Access Paper or Ask Questions

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Dec 07, 2023

Zhaoheng Zheng, Jingmin Wei, Xuefeng Hu, Haidong Zhu, Ram Nevatia

Figure 1 for Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Figure 2 for Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Figure 3 for Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Figure 4 for Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Abstract:Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets.

* Technical Report

Via

Access Paper or Ask Questions

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Aug 04, 2023

Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao, Xiao Zeng, Min Sun(+2 more)

Figure 1 for ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Figure 2 for ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Figure 3 for ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Figure 4 for ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Abstract:Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.

Via

Access Paper or Ask Questions

Efficient Feature Distillation for Zero-shot Detection

Mar 23, 2023

Zhuoming Liu, Xuefeng Hu, Ram Nevatia

Abstract:The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4% in LVIS overall setting with 1/10 training time.

Via

Access Paper or Ask Questions

MixNorm: Test-Time Adaptation Through Online Normalization Estimation

Oct 21, 2021

Xuefeng Hu, Gokhan Uzunbas, Sirius Chen, Rui Wang, Ashish Shah, Ram Nevatia, Ser-Nam Lim

Figure 1 for MixNorm: Test-Time Adaptation Through Online Normalization Estimation

Figure 2 for MixNorm: Test-Time Adaptation Through Online Normalization Estimation

Figure 3 for MixNorm: Test-Time Adaptation Through Online Normalization Estimation

Figure 4 for MixNorm: Test-Time Adaptation Through Online Normalization Estimation

Abstract:We present a simple and effective way to estimate the batch-norm statistics during test time, to fast adapt a source model to target test samples. Known as Test-Time Adaptation, most prior works studying this task follow two assumptions in their evaluation where (1) test samples come together as a large batch, and (2) all from a single test distribution. However, in practice, these two assumptions may not stand, the reasons for which we propose two new evaluation settings where batch sizes are arbitrary and multiple distributions are considered. Unlike the previous methods that require a large batch of single distribution during test time to calculate stable batch-norm statistics, our method avoid any dependency on large online batches and is able to estimate accurate batch-norm statistics with a single sample. The proposed method significantly outperforms the State-Of-The-Art in the newly proposed settings in Test-Time Adaptation Task, and also demonstrates improvements in various other settings such as Source-Free Unsupervised Domain Adaptation and Zero-Shot Classification.

Via

Access Paper or Ask Questions

SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

Mar 30, 2021

Zijian Hu, Zhengyu Yang, Xuefeng Hu, Ram Nevatia

Figure 1 for SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

Figure 2 for SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

Figure 3 for SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

Figure 4 for SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

Abstract:A common classification task situation is where one has a large amount of data available for training, but only a small portion is annotated with class labels. The goal of semi-supervised training, in this context, is to improve classification accuracy by leverage information not only from labeled data but also from a large amount of unlabeled data. Recent works have developed significant improvements by exploring the consistency constrain between differently augmented labeled and unlabeled data. Following this path, we propose a novel unsupervised objective that focuses on the less studied relationship between the high confidence unlabeled data that are similar to each other. The new proposed Pair Loss minimizes the statistical distance between high confidence pseudo labels with similarity above a certain threshold. Combining the Pair Loss with the techniques developed by the MixMatch family, our proposed SimPLE algorithm shows significant performance gains over previous algorithms on CIFAR-100 and Mini-ImageNet, and is on par with the state-of-the-art methods on CIFAR-10 and SVHN. Furthermore, SimPLE also outperforms the state-of-the-art methods in the transfer learning setting, where models are initialized by the weights pre-trained on ImageNet or DomainNet-Real. The code is available at github.com/zijian-hu/SimPLE.

* Accepted to CVPR 2021. First two authors contributed equally

Via

Access Paper or Ask Questions

SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

Sep 01, 2020

Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri, Zhenheng Yang, Ram Nevatia

Figure 1 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

Figure 2 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

Figure 3 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

Figure 4 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

Abstract:We present a novel framework, Spatial Pyramid Attention Network (SPAN) for detection and localization of multiple types of image manipulations. The proposed architecture efficiently and effectively models the relationship between image patches at multiple scales by constructing a pyramid of local self-attention blocks. The design includes a novel position projection to encode the spatial positions of the patches. SPAN is trained on a generic, synthetic dataset but can also be fine tuned for specific datasets; The proposed method shows significant gains in performance on standard datasets over previous state-of-the-art methods.

* ECCV 2020

Via

Access Paper or Ask Questions

Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

Mar 04, 2019

Svebor Karaman, Xudong Lin, Xuefeng Hu, Shih-Fu Chang

Figure 1 for Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

Figure 2 for Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

Figure 3 for Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

Abstract:We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation. Such compact hash codes enable the complete elimination of real-valued feature storage and allow for significant reduction of the computation complexity and storage cost of large-scale image retrieval applications. Specifically, we learn a neural network-based model, which transforms the input representation into a binary representation. We formalize the training objective of the network in an intuitive and effective way, considering each training sample as a query and aiming to obtain the same retrieval results using the produced hash codes as those obtained with the original features. This training formulation directly optimizes the hashing model for the target usage of the hash codes it produces. We further explore the addition of a decoder trained to obtain an approximated reconstruction of the original features. At test time, we retrieved the most promising database samples with an efficient graph-based search procedure using only our hash codes and perform re-ranking using the reconstructed features, thus without needing to access the original features at all. Experiments conducted on multiple publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost.

Via

Access Paper or Ask Questions