Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Ding

NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications

Mar 14, 2025

Li Cui, Yang Ding, Richard Hartley, Zirui Xie, Laurent Kneip, Zhenghua Yu

Abstract:We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-dimensional latent codes. Furthermore, the newly proposed architecture exhibits a significant performance improvement in the presence of only sparse and noisy data, which is demonstrated through comparative experiments on synthetic data. The module is embedded into the back-end of a stereo-vision based framework for joint, incremental shape optimization. The loss function is given by a combination of a sparse 3D point-based SDF loss, a sparse rendering loss, and a semantic mask-based silhouette-consistency term. We furthermore leverage semantic information to determine keypoint extraction density in the front-end. Finally, experimental results on real-world data reveal accurate and reliable performance comparable to alternative frameworks that make use of direct depth readings. The proposed method performs well with only sparse 3D points obtained from bundle adjustment, and eventually continues to deliver stable results even under exclusive use of the mask-consistency term.

* 9 pages, 5 figures, IROS 2024

Via

Access Paper or Ask Questions

Long Range Switching Time Series Prediction via State Space Model

Jul 27, 2024

Jiaming Zhang, Yang Ding, Yunfeng Gao

Figure 1 for Long Range Switching Time Series Prediction via State Space Model

Figure 2 for Long Range Switching Time Series Prediction via State Space Model

Figure 3 for Long Range Switching Time Series Prediction via State Space Model

Figure 4 for Long Range Switching Time Series Prediction via State Space Model

Abstract:In this study, we delve into the Structured State Space Model (S4), Change Point Detection methodologies, and the Switching Non-linear Dynamics System (SNLDS). Our central proposition is an enhanced inference technique and long-range dependency method for SNLDS. The cornerstone of our approach is the fusion of S4 and SNLDS, leveraging the strengths of both models to effectively address the intricacies of long-range dependencies in switching time series. Through rigorous testing, we demonstrate that our proposed methodology adeptly segments and reproduces long-range dependencies in both the 1-D Lorenz dataset and the 2-D bouncing ball dataset. Notably, our integrated approach outperforms the standalone SNLDS in these tasks.

* 14 pages, 14 figures

Via

Access Paper or Ask Questions

Model-enhanced Vector Index

Sep 23, 2023

Hailin Zhang, Yujing Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao, Yingyan Hou, Yang Ding, Xupeng Miao, Haonan Wang(+9 more)

Abstract:Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall performance. Recent research indicates that deep retrieval solutions offer better model quality, but are hindered by unacceptable serving latency and the inability to support document updates. In this paper, we aim to enhance the vector index with end-to-end deep generative models, leveraging the differentiable advantages of deep retrieval models while maintaining desirable serving efficiency. We propose Model-enhanced Vector Index (MEVI), a differentiable model-enhanced index empowered by a twin-tower representation model. MEVI leverages a Residual Quantization (RQ) codebook to bridge the sequence-to-sequence deep retrieval and embedding-based models. To substantially reduce the inference time, instead of decoding the unique document ids in long sequential steps, we first generate some semantic virtual cluster ids of candidate documents in a small number of steps, and then leverage the well-adapted embedding vectors to further perform a fine-grained search for the relevant documents in the candidate virtual clusters. We empirically show that our model achieves better performance on the commonly used academic benchmarks MSMARCO Passage and Natural Questions, with comparable serving latency to dense retrieval solutions.

Via

Access Paper or Ask Questions

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

Sep 09, 2023

Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, Nancy F. Chen

Abstract:We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Aug 27, 2023

Jiamin Zhuang, Jing Yu, Yang Ding, Xiangyan Qu, Yue Hu

Abstract:Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on Flickr30K, MSCOCO 1K and MS-COCO 5K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.

* IEEE Transactions on Multimedia ( Early Access ), 29 May 2023
* Accepted in IEEE Transactions on Multimedia (TMM)

Via

Access Paper or Ask Questions

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Mar 17, 2022

Yang Ding, Jing Yu, Bang Liu, Yue Hu, Mingxin Cui, Qi Wu

Figure 1 for MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Figure 2 for MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Figure 3 for MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Figure 4 for MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Abstract:Knowledge-based visual question answering requires the ability of associating external knowledge for open-ended cross-modal scene understanding. One limitation of existing solutions is that they capture relevant knowledge from text-only knowledge bases, which merely contain facts expressed by first-order predicates or language descriptions while lacking complex but indispensable multimodal knowledge for visual understanding. How to construct vision-relevant and explainable multimodal knowledge for the VQA scenario has been less studied. In this paper, we propose MuKEA to represent multimodal knowledge by an explicit triplet to correlate visual objects and fact answers with implicit relations. To bridge the heterogeneous gap, we propose three objective losses to learn the triplet representations from complementary views: embedding structure, topological relation and semantic space. By adopting a pre-training and fine-tuning learning strategy, both basic and domain-specific multimodal knowledge are progressively accumulated for answer prediction. We outperform the state-of-the-art by 3.35% and 6.08% respectively on two challenging knowledge-required datasets: OK-VQA and KRVQA. Experimental results prove the complementary benefits of the multimodal knowledge with existing knowledge bases and the advantages of our end-to-end framework over the existing pipeline methods. The code is available at https://github.com/AndersonStra/MuKEA.

* Accepted by CVPR2022

Via

Access Paper or Ask Questions

Addressing the Vulnerability of NMT in Input Perturbations

Apr 20, 2021

Weiwen Xu, Ai Ti Aw, Yang Ding, Kui Wu, Shafiq Joty

Figure 1 for Addressing the Vulnerability of NMT in Input Perturbations

Figure 2 for Addressing the Vulnerability of NMT in Input Perturbations

Figure 3 for Addressing the Vulnerability of NMT in Input Perturbations

Figure 4 for Addressing the Vulnerability of NMT in Input Perturbations

Abstract:Neural Machine Translation (NMT) has achieved significant breakthrough in performance but is known to suffer vulnerability to input perturbations. As real input noise is difficult to predict during training, robustness is a big issue for system deployment. In this paper, we improve the robustness of NMT models by reducing the effect of noisy words through a Context-Enhanced Reconstruction (CER) approach. CER trains the model to resist noise in two steps: (1) perturbation step that breaks the naturalness of input sequence with made-up words; (2) reconstruction step that defends the noise propagation by generating better and more robust contextual representation. Experimental results on Chinese-English (ZH-EN) and French-English (FR-EN) translation tasks demonstrate robustness improvement on both news and social media text. Further fine-tuning experiments on social media text show our approach can converge at a higher position and provide a better adaptation.

* Accepted by NAACL 2021 Industry Track

Via

Access Paper or Ask Questions