Abstract:3D Human Mesh Reconstruction (HMR) from 2D RGB images faces challenges in environments with poor lighting, privacy concerns, or occlusions. These weaknesses of RGB imaging can be complemented by acoustic signals, which are widely available, easy to deploy, and capable of penetrating obstacles. However, no existing methods effectively combine acoustic signals with RGB data for robust 3D HMR. The primary challenges include the low-resolution images generated by acoustic signals and the lack of dedicated processing backbones. We introduce SonicMesh, a novel approach combining acoustic signals with RGB images to reconstruct 3D human mesh. To address the challenges of low resolution and the absence of dedicated processing backbones in images generated by acoustic signals, we modify an existing method, HRNet, for effective feature extraction. We also integrate a universal feature embedding technique to enhance the precision of cross-dimensional feature alignment, enabling SonicMesh to achieve high accuracy. Experimental results demonstrate that SonicMesh accurately reconstructs 3D human mesh in challenging environments such as occlusions, non-line-of-sight scenarios, and poor lighting.
Abstract:Dense prediction is a critical task in computer vision. However, previous methods often require extensive computational resources, which hinders their real-world application. In this paper, we propose BiDense, a generalized binary neural network (BNN) designed for efficient and accurate dense prediction tasks. BiDense incorporates two key techniques: the Distribution-adaptive Binarizer (DAB) and the Channel-adaptive Full-precision Bypass (CFB). The DAB adaptively calculates thresholds and scaling factors for binarization, effectively retaining more information within BNNs. Meanwhile, the CFB facilitates full-precision bypassing for binary convolutional layers undergoing various channel size transformations, which enhances the propagation of real-valued signals and minimizes information loss. By leveraging these techniques, BiDense preserves more real-valued information, enabling more accurate and detailed dense predictions in BNNs. Extensive experiments demonstrate that our framework achieves performance levels comparable to full-precision models while significantly reducing memory usage and computational costs.
Abstract:In this paper, we propose a novel and efficient parameter estimator based on $k$-Nearest Neighbor ($k$NN) and data generation method for the Lognormal-Rician turbulence channel. The Kolmogorov-Smirnov (KS) goodness-of-fit statistical tools are employed to investigate the validity of $k$NN approximation under different channel conditions and it is shown that the choice of $k$ plays a significant role in the approximation accuracy. We present several numerical results to illustrate that solving the constructed objective function can provide a reasonable estimate for the actual values. The accuracy of the proposed estimator is investigated in terms of the mean square error. The simulation results show that increasing the number of generation samples by two orders of magnitude does not lead to a significant improvement in estimation performance when solving the optimization problem by the gradient descent algorithm. However, the estimation performance under the genetic algorithm (GA) approximates to that of the saddlepoint approximation and expectation-maximization estimators. Therefore, combined with the GA, we demonstrate that the proposed estimator achieves the best tradeoff between the computation complexity and the accuracy.
Abstract:Two-view pose estimation is essential for map-free visual relocalization and object pose tracking tasks. However, traditional matching methods suffer from time-consuming robust estimators, while deep learning-based pose regressors only cater to camera-to-world pose estimation, lacking generalizability to different image sizes and camera intrinsics. In this paper, we propose SRPose, a sparse keypoint-based framework for two-view relative pose estimation in camera-to-world and object-to-camera scenarios. SRPose consists of a sparse keypoint detector, an intrinsic-calibration position encoder, and promptable prior knowledge-guided attention layers. Given two RGB images of a fixed scene or a moving object, SRPose estimates the relative camera or 6D object pose transformation. Extensive experiments demonstrate that SRPose achieves competitive or superior performance compared to state-of-the-art methods in terms of accuracy and speed, showing generalizability to both scenarios. It is robust to different image sizes and camera intrinsics, and can be deployed with low computing resources.
Abstract:Single domain generalization aims to address the challenge of out-of-distribution generalization problem with only one source domain available. Feature distanglement is a classic solution to this purpose, where the extracted task-related feature is presumed to be resilient to domain shift. However, the absence of references from other domains in a single-domain scenario poses significant uncertainty in feature disentanglement (ill-posedness). In this paper, we propose a new framework, named \textit{Domain Game}, to perform better feature distangling for medical image segmentation, based on the observation that diagnostic relevant features are more sensitive to geometric transformations, whilist domain-specific features probably will remain invariant to such operations. In domain game, a set of randomly transformed images derived from a singular source image is strategically encoded into two separate feature sets to represent diagnostic features and domain-specific features, respectively, and we apply forces to pull or repel them in the feature space, accordingly. Results from cross-site test domain evaluation showcase approximately an ~11.8% performance boost in prostate segmentation and around ~10.5% in brain tumor segmentation compared to the second-best method.
Abstract:Mobile edge computing (MEC) deployment in a multi-robot cooperation (MRC) system is an effective way to accomplish the tasks in terms of energy consumption and implementation latency. However, the computation and communication resources need to be considered jointly to fully exploit the advantages brought by the MEC technology. In this paper, the scenario where multi robots cooperate to accomplish the time-critical tasks is studied, where an intelligent master robot (MR) acts as an edge server to provide services to multiple slave robots (SRs) and the SRs are responsible for the environment sensing and data collection. To save energy and prolong the function time of the system, two schemes are proposed to optimize the computation and communication resources, respectively. In the first scheme, the energy consumption of SRs is minimized and balanced while guaranteeing that the tasks are accomplished under a time constraint. In the second scheme, not only the energy consumption, but also the remaining energies of the SRs are considered to enhance the robustness of the system. Through the analysis and numerical simulations, we demonstrate that even though the first policy may guarantee the minimization on the total SRs' energy consumption, the function time of MRC system by the second scheme is longer than that by the first one.
Abstract:Latent factor models play a dominant role among recommendation techniques. However, most of the existing latent factor models assume embedding dimensions are independent of each other, and thus regrettably ignore the interaction information across different embedding dimensions. In this paper, we propose a novel latent factor model called COMET (COnvolutional diMEnsion inTeraction), which provides the first attempt to model higher-order interaction signals among all latent dimensions in an explicit manner. To be specific, COMET stacks the embeddings of historical interactions horizontally, which results in two "embedding maps" that encode the original dimension information. In this way, users' and items' internal interactions can be exploited by convolutional neural networks with kernels of different sizes and a fully-connected multi-layer perceptron. Furthermore, the representations of users and items are enriched by the learnt interaction vectors, which can further be used to produce the final prediction. Extensive experiments and ablation studies on various public implicit feedback datasets clearly demonstrate the effectiveness and the rationality of our proposed method.
Abstract:Graph-based recommendation models work well for top-N recommender systems due to their capability to capture the potential relationships between entities. However, most of the existing methods only construct a single global item graph shared by all the users and regrettably ignore the diverse tastes between different user groups. Inspired by the success of local models for recommendation, this paper provides the first attempt to investigate multiple local item graphs along with a global item graph for graph-based recommendation models. We argue that recommendation on global and local graphs outperforms that on a single global graph or multiple local graphs. Specifically, we propose a novel graph-based recommendation model named GLIMG (Global and Local IteM Graphs), which simultaneously captures both the global and local user tastes. By integrating the global and local graphs into an adapted semi-supervised learning model, users' preferences on items are propagated globally and locally. Extensive experimental results on real-world datasets show that our proposed method consistently outperforms the state-of-the art counterparts on the top-N recommendation task.