Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhongwei Xie

SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

May 17, 2025

Songchun Zhang, Huiyao Xu, Sitong Guo, Zhongwei Xie, Pengwei Liu, Hujun Bao, Weiwei Xu, Changqing Zou

Abstract:Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.

* 18 pages, 16 figures

Via

Access Paper or Ask Questions

Top Ten Challenges Towards Agentic Neural Graph Databases

Jan 24, 2025

Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang(+9 more)

Abstract:Graph databases (GDBs) like Neo4j and TigerGraph excel at handling interconnected data but lack advanced inference capabilities. Neural Graph Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for predictive analysis and reasoning over incomplete or noisy data. However, NGDBs rely on predefined queries and lack autonomy and adaptability. This paper introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs with three core functionalities: autonomous query construction, neural query execution, and continuous learning. We identify ten key challenges in realizing Agentic NGDBs: semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like large language models (LLMs). By addressing these challenges, Agentic NGDBs can enable intelligent, self-improving systems for modern data-driven applications, paving the way for adaptable and autonomous data management solutions.

* 12 Pages

Via

Access Paper or Ask Questions

Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning

Jan 06, 2025

Zhen Li, Yupeng Su, Runming Yang, Zhongwei Xie, Ngai Wong, Hongxia Yang

Abstract:Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. We introduce a multidimensional evaluation framework that qualitatively assesses specific capability dimensions and conduct quantitative analyses on the step-by-step outputs of various quantization methods. Our results demonstrate that quantization differentially affects numerical computation and reasoning planning abilities, identifying key areas where quantized models experience performance degradation.

* 4 pages

Via

Access Paper or Ask Questions

Morpho-Aware Global Attention for Image Matting

Nov 15, 2024

Jingru Yang, Chengzhi Cao, Chentianye Xu, Zhongwei Xie, Kaixiang Huang, Yang Zhou, Shengfeng He

Figure 1 for Morpho-Aware Global Attention for Image Matting

Figure 2 for Morpho-Aware Global Attention for Image Matting

Figure 3 for Morpho-Aware Global Attention for Image Matting

Figure 4 for Morpho-Aware Global Attention for Image Matting

Abstract:Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) face inherent challenges in image matting, particularly in preserving fine structural details. ViTs, with their global receptive field enabled by the self-attention mechanism, often lose local details such as hair strands. Conversely, CNNs, constrained by their local receptive field, rely on deeper layers to approximate global context but struggle to retain fine structures at greater depths. To overcome these limitations, we propose a novel Morpho-Aware Global Attention (MAGA) mechanism, designed to effectively capture the morphology of fine structures. MAGA employs Tetris-like convolutional patterns to align the local shapes of fine structures, ensuring optimal local correspondence while maintaining sensitivity to morphological details. The extracted local morphology information is used as query embeddings, which are projected onto global key embeddings to emphasize local details in a broader context. Subsequently, by projecting onto value embeddings, MAGA seamlessly integrates these emphasized morphological details into a unified global structure. This approach enables MAGA to simultaneously focus on local morphology and unify these details into a coherent whole, effectively preserving fine structures. Extensive experiments show that our MAGA-based ViT achieves significant performance gains, outperforming state-of-the-art methods across two benchmarks with average improvements of 4.3% in SAD and 39.5% in MSE.

Via

Access Paper or Ask Questions

Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Oct 22, 2021

Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, Lin Li

Figure 1 for Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Figure 2 for Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Figure 3 for Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Figure 4 for Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Abstract:This paper introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using word2vec. We leverage wideResNet50 and word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, taking into account also the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with deep feature engineering significantly outperforms the state-of-the-art approaches.

* accepted by ACM Transactions on Information Systems(TOIS). arXiv admin note: text overlap with arXiv:2108.00705, arXiv:2108.03788

Via

Access Paper or Ask Questions

Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Aug 18, 2021

Zhongwei Xie, Ling Liu, Lin Li, Luo Zhong

Figure 1 for Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Figure 2 for Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Figure 3 for Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Figure 4 for Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Abstract:This paper presents a three-tier modality alignment approach to learning text-image joint embedding, coined as JEMA, for cross-modal retrieval of cooking recipes and food images. The first tier improves recipe text embedding by optimizing the LSTM networks with term extraction and ranking enhanced sequence patterns, and optimizes the image embedding by combining the ResNeXt-101 image encoder with the category embedding using wideResNet-50 with word2vec. The second tier modality alignment optimizes the textual-visual joint embedding loss function using a double batch-hard triplet loss with soft-margin optimization. The third modality alignment incorporates two types of cross-modality alignments as the auxiliary loss regularizations to further reduce the alignment errors in the joint learning of the two modality-specific embedding functions. The category-based cross-modal alignment aims to align the image category with the recipe category as a loss regularization to the joint embedding. The cross-modal discriminator-based alignment aims to add the visual-textual embedding distribution alignment to further regularize the joint embedding loss. Extensive experiments with the one-million recipes benchmark dataset Recipe1M demonstrate that the proposed JEMA approach outperforms the state-of-the-art cross-modal embedding methods for both image-to-recipe and recipe-to-image retrievals.

* accepted by CIKM 2021. arXiv admin note: substantial text overlap with arXiv:2108.00705

Via

Access Paper or Ask Questions

Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

Aug 08, 2021

Zhongwei Xie, Ling Liu, Lin Li, Luo Zhong

Figure 1 for Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

Figure 2 for Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

Figure 3 for Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

Figure 4 for Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

Abstract:This paper introduces a two-phase deep feature calibration framework for efficient learning of semantics enhanced text-image cross-modal joint embedding, which clearly separates the deep feature calibration in data preprocessing from training the joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature calibration by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, NLP methods to produce ranking scores for key terms before generating the key term feature. We leverage wideResNet50 to extract and encode the image category semantics to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature calibration by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, also utilizing the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with the deep feature calibration significantly outperforms the state-of-the-art approaches.

* accepted by ACM ICMI 2021 conference

Via

Access Paper or Ask Questions

Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Aug 02, 2021

Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong

Figure 1 for Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Figure 2 for Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Figure 3 for Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Figure 4 for Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Abstract:It is widely acknowledged that learning joint embeddings of recipes with images is challenging due to the diverse composition and deformation of ingredients in cooking procedures. We present a Multi-modal Semantics enhanced Joint Embedding approach (MSJE) for learning a common feature space between the two modalities (text and image), with the ultimate goal of providing high-performance cross-modal retrieval services. Our MSJE approach has three unique features. First, we extract the TFIDF feature from the title, ingredients and cooking instructions of recipes. By determining the significance of word sequences through combining LSTM learned features with their TFIDF features, we encode a recipe into a TFIDF weighted vector for capturing significant key terms and how such key terms are used in the corresponding cooking instructions. Second, we combine the recipe TFIDF feature with the recipe sequence feature extracted through two-stage LSTM networks, which is effective in capturing the unique relationship between a recipe and its associated image(s). Third, we further incorporate TFIDF enhanced category semantics to improve the mapping of image modality and to regulate the similarity loss function during the iterative learning of cross-modal joint embedding. Experiments on the benchmark dataset Recipe1M show the proposed approach outperforms the state-of-the-art approaches.

* accepted by IEEE Transactions on Services Computing

Via

Access Paper or Ask Questions

Promoting High Diversity Ensemble Learning with EnsembleBench

Oct 20, 2020

Yanzhao Wu, Ling Liu, Zhongwei Xie, Juhyun Bae, Ka-Ho Chow, Wenqi Wei

Figure 1 for Promoting High Diversity Ensemble Learning with EnsembleBench

Figure 2 for Promoting High Diversity Ensemble Learning with EnsembleBench

Figure 3 for Promoting High Diversity Ensemble Learning with EnsembleBench

Figure 4 for Promoting High Diversity Ensemble Learning with EnsembleBench

Abstract:Ensemble learning is gaining renewed interests in recent years. This paper presents EnsembleBench, a holistic framework for evaluating and recommending high diversity and high accuracy ensembles. The design of EnsembleBench offers three novel features: (1) EnsembleBench introduces a set of quantitative metrics for assessing the quality of ensembles and for comparing alternative ensembles constructed for the same learning tasks. (2) EnsembleBench implements a suite of baseline diversity metrics and optimized diversity metrics for identifying and selecting ensembles with high diversity and high quality, making it an effective framework for benchmarking, evaluating and recommending high diversity model ensembles. (3) Four representative ensemble consensus methods are provided in the first release of EnsembleBench, enabling empirical study on the impact of consensus methods on ensemble accuracy. A comprehensive experimental evaluation on popular benchmark datasets demonstrates the utility and effectiveness of EnsembleBench for promoting high diversity ensembles and boosting the overall performance of selected ensembles.

Via

Access Paper or Ask Questions

Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings

Oct 22, 2018

Zhongwei Xie, Lin Li, Xian Zhong, Luo Zhong

Figure 1 for Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings

Figure 2 for Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings

Abstract:Image-to-video person re-identification identifies a target person by a probe image from quantities of pedestrian videos captured by non-overlapping cameras. Despite the great progress achieved,it's still challenging to match in the multimodal scenario,i.e. between image and video. Currently,state-of-the-art approaches mainly focus on the task-specific data,neglecting the extra information on the different but related tasks. In this paper,we propose an end-to-end neural network framework for image-to-video person reidentification by leveraging cross-modal embeddings learned from extra information.Concretely speaking,cross-modal embeddings from image captioning and video captioning models are reused to help learned features be projected into a coordinated space,where similarity can be directly computed. Besides,training steps from fixed model reuse approach are integrated into our framework,which can incorporate beneficial information and eventually make the target networks independent of existing models. Apart from that,our proposed framework resorts to CNNs and LSTMs for extracting visual and spatiotemporal features,and combines the strengths of identification and verification model to improve the discriminative ability of the learned feature. The experimental results demonstrate the effectiveness of our framework on narrowing down the gap between heterogeneous data and obtaining observable improvement in image-to-video person re-identification.

* under review for Pattern Recognition Letters

Via

Access Paper or Ask Questions