Abstract:Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.
Abstract:Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.
Abstract:In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.
Abstract:Feature matching is a fundamental and crucial process in visual SLAM, and precision has always been a challenging issue in feature matching. In this paper, based on a multi-level fine matching strategy, we propose a new feature matching method called KTGP-ORB. This method utilizes the similarity of local appearance in the Hamming space generated by feature descriptors to establish initial correspondences. It combines the constraint of local image motion smoothness, uses the GMS algorithm to enhance the accuracy of initial matches, and finally employs the PROSAC algorithm to optimize matches, achieving precise matching based on global grayscale information in Euclidean space. Experimental results demonstrate that the KTGP-ORB method reduces the error by an average of 29.92% compared to the ORB algorithm in complex scenes with illumination variations and blur.
Abstract:In Visual SLAM, achieving accurate feature matching consumes a significant amount of time, severely impacting the real-time performance of the system. This paper proposes an accelerated method for Visual SLAM by integrating GMS (Grid-based Motion Statistics) with RANSAC (Random Sample Consensus) for the removal of mismatched features. The approach first utilizes the GMS algorithm to estimate the quantity of matched pairs within the neighborhood and ranks the matches based on their confidence. Subsequently, the Random Sample Consensus (RANSAC) algorithm is employed to further eliminate mismatched features. To address the time-consuming issue of randomly selecting all matched pairs, this method transforms it into the problem of prioritizing sample selection from high-confidence matches. This enables the iterative solution of the optimal model. Experimental results demonstrate that the proposed method achieves a comparable accuracy to the original GMS-RANSAC while reducing the average runtime by 24.13% on the KITTI, TUM desk, and TUM doll datasets.
Abstract:To address the issue of increased triangulation uncertainty caused by selecting views with small camera baselines in Structure from Motion (SFM) view selection, this paper proposes a robust error-resistant view selection method. The method utilizes a triangulation-based computation to obtain an error-resistant model, which is then used to construct an error-resistant matrix. The sorting results of each row in the error-resistant matrix determine the candidate view set for each view. By traversing the candidate view sets of all views and completing the missing views based on the error-resistant matrix, the integrity of 3D reconstruction is ensured. Experimental comparisons between this method and the exhaustive method with the highest accuracy in the COLMAP program are conducted in terms of average reprojection error and absolute trajectory error in the reconstruction results. The proposed method demonstrates an average reduction of 29.40% in reprojection error accuracy and 5.07% in absolute trajectory error on the TUM dataset and DTU dataset.
Abstract:The images captured by Wireless Capsule Endoscopy (WCE) always exhibit specular reflections, and removing highlights while preserving the color and texture in the region remains a challenge. To address this issue, this paper proposes a highlight removal method for capsule endoscopy images. Firstly, the confidence and feature terms of the highlight region's edges are computed, where confidence is obtained by the ratio of known pixels in the RGB space's R channel to the B channel within a window centered on the highlight region's edge pixel, and feature terms are acquired by multiplying the gradient vector of the highlight region's edge pixel with the iso-intensity line. Subsequently, the confidence and feature terms are assigned different weights and summed to obtain the priority of all highlight region's edge pixels, and the pixel with the highest priority is identified. Then, the variance of the highlight region's edge pixels is used to adjust the size of the sample block window, and the best-matching block is searched in the known region based on the RGB color similarity and distance between the sample block and the window centered on the pixel with the highest priority. Finally, the pixels in the best-matching block are copied to the highest priority highlight removal region to achieve the goal of removing the highlight region. Experimental results demonstrate that the proposed method effectively removes highlights from WCE images, with a lower coefficient of variation in the highlight removal region compared to the Crinimisi algorithm and DeepGin method. Additionally, the color and texture in the highlight removal region are similar to those in the surrounding areas, and the texture is continuous.
Abstract:This paper presents, for the first time, an image enhancement methodology designed to enhance the clarity of small intestinal villi in Wireless Capsule Endoscopy (WCE) images. This method first separates the low-frequency and high-frequency components of small intestinal villi images using guided filtering. Subsequently, an adaptive light gain factor is generated based on the low-frequency component, and an adaptive gradient gain factor is derived from the convolution results of the Laplacian operator in different regions of small intestinal villi images. The obtained light gain factor and gradient gain factor are then combined to enhance the high-frequency components. Finally, the enhanced high-frequency component is fused with the original image to achieve adaptive sharpening of the edges of WCE small intestinal villi images. The experiments affirm that, compared to established WCE image enhancement methods, our approach not only accentuates the edge details of WCE small intestine villi images but also skillfully suppresses noise amplification, thereby preventing the occurrence of edge overshooting.
Abstract:To address the issue of feature descriptors being ineffective in representing grayscale feature information when images undergo high affine transformations, leading to a rapid decline in feature matching accuracy, this paper proposes a region feature descriptor based on simulating affine transformations using classification. The proposed method initially categorizes images with different affine degrees to simulate affine transformations and generate a new set of images. Subsequently, it calculates neighborhood information for feature points on this new image set. Finally, the descriptor is generated by combining the grayscale histogram of the maximum stable extremal region to which the feature point belongs and the normalized position relative to the grayscale centroid of the feature point's region. Experimental results, comparing feature matching metrics under affine transformation scenarios, demonstrate that the proposed descriptor exhibits higher precision and robustness compared to existing classical descriptors. Additionally, it shows robustness when integrated with other descriptors.
Abstract:Skeleton-based action recognition is a central task of human-computer interaction. However, most of the previous methods suffer from two issues: (i) semantic ambiguity arising from spatiotemporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to local optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into almost all previous skeleton encoders and have no impact on the skeleton encoders when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatiotemporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvement on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released.