Abstract:Accurately measuring the geometry and spatially-varying reflectance of real-world objects is a complex task due to their intricate shapes formed by concave features, hollow engravings and diverse surfaces, resulting in inter-reflection and occlusion when photographed. Moreover, issues like lens flare and overexposure can arise from interference from secondary reflections and limitations of hardware even in professional studios. In this paper, we propose a novel approach using polarized reflectance field capture and a comprehensive statistical analysis algorithm to obtain highly accurate surface normals (within 0.1mm/px) and spatially-varying reflectance data, including albedo, specular separation, roughness, and anisotropy parameters for realistic rendering and analysis. Our algorithm removes image artifacts via analytical modeling and further employs both an initial step and an optimization step computed on the whole image collection to further enhance the precision of per-pixel surface reflectance and normal measurement. We showcase the captured shapes and reflectance of diverse objects with a wide material range, spanning from highly diffuse to highly glossy - a challenge unaddressed by prior techniques. Our approach enhances downstream applications by offering precise measurements for realistic rendering and provides a valuable training dataset for emerging research in inverse rendering. We will release the polarized reflectance fields of several captured objects with this work.
Abstract:With the rise of large-scale language models (LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but without heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has won the first place in the official leaderboard of WebQA (since April 10, 2024), and achieves competitive results on MultimodalQA.
Abstract:Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi-organ, multi-class cell segmentation is particularly challenging. Most existing methods employ multi-branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi-class analysis due to intricate textural details. To address these challenges, we propose a Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision-language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse-to-fine-grained features for segmenting cells of various shapes, including high-frequency, convolutional, and topological features. Inspired by the synergy of textual and multi-grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state-of-the-art cell segmentation methods and vision-language models. Codes and implementations will be made publicly available.
Abstract:Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.
Abstract:Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.
Abstract:Fine-grained sentiment analysis involves extracting and organizing sentiment elements from textual data. However, existing approaches often overlook issues of category semantic inclusion and overlap, as well as inherent structural patterns within the target sequence. This study introduces a generative sentiment analysis model. To address the challenges related to category semantic inclusion and overlap, a latent category distribution variable is introduced. By reconstructing the input of a variational autoencoder, the model learns the intensity of the relationship between categories and text, thereby improving sequence generation. Additionally, a trie data structure and constrained decoding strategy are utilized to exploit structural patterns, which in turn reduces the search space and regularizes the generation process. Experimental results on the Restaurant-ACOS and Laptop-ACOS datasets demonstrate a significant performance improvement compared to baseline models. Ablation experiments further confirm the effectiveness of latent category distribution and constrained decoding strategy.
Abstract:Current hair transfer methods struggle to handle diverse and intricate hairstyles, thus limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named \textit{Stable-Hair}, which robustly transfers a wide range of real-world hairstyles onto user-provided faces for virtual hair try-on. To achieve this goal, our Stable-Hair framework is designed as a two-stage pipeline. In the first stage, we train a Bald Converter alongside stable diffusion to remove hair from the user-provided face images, resulting in bald images. In the second stage, we specifically designed three modules: a Hair Extractor, a Latent IdentityNet, and Hair Cross-Attention Layers to transfer the target hairstyle with highly detailed and high-fidelity to the bald image. Specifically, the Hair Extractor is trained to encode reference images with the desired hairstyles. To preserve the consistency of identity content and background between the source images and the transfer results, we employ a Latent IdentityNet to encode the source images. With the assistance of our Hair Cross-Attention Layers in the U-Net, we can accurately and precisely transfer the highly detailed and high-fidelity hairstyle to the bald image. Extensive experiments have demonstrated that our approach delivers state-of-the-art (SOTA) results among existing hair transfer methods. Project page: \textcolor{red}{\url{https://xiaojiu-z.github.io/Stable-Hair.github.io/}}
Abstract:Open-source EDA tools are rapidly advancing, fostering collaboration, innovation, and knowledge sharing within the EDA community. However, the growing complexity of these tools, characterized by numerous design parameters and heuristics, poses a significant barrier to their widespread adoption. This complexity is particularly pronounced in integrated circuit (IC) backend designs, which place substantial demands on engineers' expertise in EDA tools. To tackle this challenge, we introduce IICPilot, an intelligent IC backend design system based on LLM technology. IICPilot automates various backend design procedures, including script generation, EDA tool invocation, design space exploration of EDA parameters, container-based computing resource allocation, and exception management. By automating these tasks, IICPilot significantly lowers the barrier to entry for open-source EDA tools. Specifically, IICPilot utilizes LangChain's multi-agent framework to efficiently handle distinct design tasks, enabling flexible enhancements independently. Moreover, IICPilot separates the backend design workflow from specific open-source EDA tools through a unified EDA calling interface. This approach allows seamless integration with different open-source EDA tools like OpenROAD and iEDA, streamlining the backend design and optimization across the EDA tools.
Abstract:With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry information from the depth map by devising three modality-specific branches and a multi-modality integration branch. The modality-specific branches extract the representation of different inputs, while the multi-modality integration branch combines the multi-level modality-specific features by introducing the encoder feature aggregation (MEA) modules and decoder feature aggregation (MDA) modules. The experimental findings conducted on both our newly introduced ViDSOD-100 dataset and the well-established DAVSOD dataset highlight the superior performance of the proposed ATF-Net. This performance enhancement is demonstrated both quantitatively and qualitatively, surpassing the capabilities of current state-of-the-art techniques across various domains, including RGB-D saliency detection, video saliency detection, and video object segmentation. Our data and our code are available at github.com/jhl-Det/RGBD_Video_SOD.
Abstract:Video recognition remains an open challenge, requiring the identification of diverse content categories within videos. Mainstream approaches often perform flat classification, overlooking the intrinsic hierarchical structure relating categories. To address this, we formalize the novel task of hierarchical video recognition, and propose a video-language learning framework tailored for hierarchical recognition. Specifically, our framework encodes dependencies between hierarchical category levels, and applies a top-down constraint to filter recognition predictions. We further construct a new fine-grained dataset based on medical assessments for rehabilitation of stroke patients, serving as a challenging benchmark for hierarchical recognition. Through extensive experiments, we demonstrate the efficacy of our approach for hierarchical recognition, significantly outperforming conventional methods, especially for fine-grained subcategories. The proposed framework paves the way for hierarchical modeling in video understanding tasks, moving beyond flat categorization.