Abstract:Large language models (LLMs), endowed with exceptional reasoning capabilities, are adept at discerning profound user interests from historical behaviors, thereby presenting a promising avenue for the advancement of recommendation systems. However, a notable discrepancy persists between the sparse collaborative semantics typically found in recommendation systems and the dense token representations within LLMs. In our study, we propose a novel framework that harmoniously merges traditional recommendation models with the prowess of LLMs. We initiate this integration by transforming ItemIDs into sequences that align semantically with the LLMs space, through the proposed Alignment Tokenization module. Additionally, we design a series of specialized supervised learning tasks aimed at aligning collaborative signals with the subtleties of natural language semantics. To ensure practical applicability, we optimize online inference by pre-caching the top-K results for each user, reducing latency and improving effciency. Extensive experimental evidence indicates that our model markedly improves recall metrics and displays remarkable scalability of recommendation systems.
Abstract:The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code: https://github.com/zhengchen1999/Grounding-IQA.
Abstract:Binary feature descriptors have been widely used in various visual measurement tasks, particularly those with limited computing resources and storage capacities. Existing binary descriptors may not perform well for long-term visual measurement tasks due to their sensitivity to illumination variations. It can be observed that when image illumination changes dramatically, the relative relationship among local patches mostly remains intact. Based on the observation, consequently, this study presents an illumination-insensitive binary (IIB) descriptor by leveraging the local inter-patch invariance exhibited in multiple spatial granularities to deal with unfavorable illumination variations. By taking advantage of integral images for local patch feature computation, a highly efficient IIB descriptor is achieved. It can encode scalable features in multiple spatial granularities, thus facilitating a computationally efficient hierarchical matching from coarse to fine. Moreover, the IIB descriptor can also apply to other types of image data, such as depth maps and semantic segmentation results, when available in some applications. Numerical experiments on both natural and synthetic datasets reveal that the proposed IIB descriptor outperforms state-of-the-art binary descriptors and some testing float descriptors. The proposed IIB descriptor has also been successfully employed in a demo system for long-term visual localization. The code of the IIB descriptor will be publicly available.
Abstract:The bound of the information transmission rate of direct current biased optical orthogonal frequency division multiplexing (DCO-OFDM) for visible light communication (VLC) with finite-alphabet inputs is yet unknown, where the corresponding spectral efficiency (SE) and energy efficiency (EE) stems out as the open research problems. In this paper, we derive the exact achievable rate of {the} DCO-OFDM system with finite-alphabet inputs for the first time. Furthermore, we investigate SE maximization problems of {the} DCO-OFDM system subject to both electrical and optical power constraints. By exploiting the relationship between the mutual information and the minimum mean-squared error, we propose a multi-level mercury-water-filling power allocation scheme to achieve the maximum SE. Moreover, the EE maximization problems of {the} DCO-OFDM system are studied, and the Dinkelbach-type power allocation scheme is developed for the maximum EE. Numerical results verify the effectiveness of the proposed theories and power allocation schemes.
Abstract:In this paper, we study the spectral efficiency (SE) and energy efficiency (EE) of asymmetrically clipped optical orthogonal frequency division multiplexing (ACO-OFDM) for visible light communication (VLC). Firstly, we derive the achiev-able rates for Gaussian distributions inputs and practical finite-alphabet inputs. Then, we investigate the SE maximization problems subject to both the total transmit power constraint and the average optical power constraint with the above two inputs, respectively. By exploiting the relationship between the mutual information and the minimum mean-squared error, an optimal power allocation scheme is proposed to maximize the SE with finite-alphabet inputs. To reduce the computational complexity of the power allocation scheme, we derive a closed-form lower bound of the SE. Also, considering the quality of service, we further tackle the non-convex EE maximization problems of ACO-OFDM with the two inputs, respectively. The problems are solved by the proposed Dinkelbach-type iterative algorithm. In each iteration, the interior point algorithm is applied to obtain the optimal power allocation.The performance of the proposed power allocation schemes for the SE and EE maximization are validated through numerical analysis.
Abstract:Many algorithms for visible light positioning (VLP) localization do not consider the shapes of the transmitters, which leads to the impracticality of the algorithm and the low localization accuracy. Therefore, this paper proposes a novel VLP algorithm and addresses the problems in terms of practicality and complexity by using one non-point transmitter based on a monocular. Because the shape of the transmitter is considered, the proposed algorithm is easy for practice and has wide applicability. Besides, it decreases the computation for simple geometric model and expands the coverage for there is a greater chance of receiving signals from one light than that of receiving signals from multiple lights.
Abstract:A wide range of optimization problems can be recast as monotone inclusion problems. We propose a unifying framework for solving the monotone inclusion problem with randomized Forward-Backward algorithms. Our framework covers many existing deterministic and stochastic algorithms. Under various conditions, we can establish both sublinear and linear convergence rates in expectation for the algorithms covered by this framework. In addition, we consider algorithm design as well as asynchronous randomized Forward algorithms. Numerical experiments demonstrate the worth of the new algorithms that emerge from our framework
Abstract:We address the problem of using hand-drawn sketches to edit the facial identity, such as enlarging the shape or modifying the position of eyes or mouth, in the entire video. This task is formulated as a 3D face model reconstruction and deformation problem. We first introduce a two-stage real-time 3D face model fitting schema to recover the facial identity and expressions from the video. User's editing intention is recognized from input sketches as a set of facial modifications. Then a novel identity deformation algorithm is proposed to transfer these facial deformations from 2D space to the 3D facial identity directly, while preserving the facial expressions. After an optional stage for further refining the 3D face model, these changes are propagated to the whole video with the modified identity. Both the user study and experimental results demonstrate that our sketching framework can help users effectively edit facial identities in videos, while high consistency and fidelity are ensured at the same time.