Abstract:Reconstructing 3D human-object interaction (HOI) from single-view RGB images is challenging due to the absence of depth information and potential occlusions. Existing methods simply predict the body poses merely rely on network training on some indoor datasets, which cannot guarantee the rationality of the results if some body parts are invisible due to occlusions that appear easily. Inspired by the end-effector localization task in robotics, we propose a kinematics-based method that can drive the joints of human body to the human-object contact regions accurately. After an improved forward kinematics algorithm is proposed, the Multi-Layer Perceptron is introduced into the solution of inverse kinematics process to determine the poses of joints, which achieves precise results than the commonly-used numerical methods in robotics. Besides, a Contact Region Recognition Network (CRRNet) is also proposed to robustly determine the contact regions using a single-view video. Experimental results demonstrate that our method outperforms the state-of-the-art on benchmark BEHAVE. Additionally, our approach shows good portability and can be seamlessly integrated into other methods for optimizations.
Abstract:Federated learning (FL) has emerged as a new paradigm for privacy-preserving collaborative training. Under domain skew, the current FL approaches are biased and face two fairness problems. 1) Parameter Update Conflict: data disparity among clients leads to varying parameter importance and inconsistent update directions. These two disparities cause important parameters to potentially be overwhelmed by unimportant ones of dominant updates. It consequently results in significant performance decreases for lower-performing clients. 2) Model Aggregation Bias: existing FL approaches introduce unfair weight allocation and neglect domain diversity. It leads to biased model convergence objective and distinct performance among domains. We discover a pronounced directional update consistency in Federated Learning and propose a novel framework to tackle above issues. First, leveraging the discovered characteristic, we selectively discard unimportant parameter updates to prevent updates from clients with lower performance overwhelmed by unimportant parameters, resulting in fairer generalization performance. Second, we propose a fair aggregation objective to prevent global model bias towards some domains, ensuring that the global model continuously aligns with an unbiased model. The proposed method is generic and can be combined with other existing FL methods to enhance fairness. Comprehensive experiments on Digits and Office-Caltech demonstrate the high fairness and performance of our method.
Abstract:Ultra-massive multiple-input multiple-output (UM-MIMO) is the enabler of Terahertz (THz) communications in next-generation wireless networks. In THz UM-MIMO systems, a new paradigm of cross-field communications spanning from near-field to far-field is emerging, since the near-field range expands with higher frequencies and larger array apertures. Precise beam alignment in cross-field is critical but challenging. Specifically, unlike far-field beams that rely only on the angle domain, the incorporation of dual-domain (angle and distance) training significantly increases overhead. A natural question arises of whether far-field beam training can be deployed for cross-field beam alignment. In this paper, this question is answered, by demonstrating that the far-field training enables sufficient signal-to-noise ratio (SNR) in both far- and near-field scenarios, while exciting all channel dimensions. Based on that, we propose a subarray-coordinated hierarchical (SCH) training with greatly reduced overhead. To further obtain high-precision beam designs, we propose a two-phase angle and distance beam estimator (TPBE). Extensive simulations demonstrate the effectiveness of the proposed methods. Compared to near-field exhaustive search, the SCH possesses 0.2\% training overhead. The TPBE achieves 0.01~degrees and 0.02~m estimation root-mean-squared errors for angle and distance. Furthermore, with the estimated beam directions, a near-optimal SNR with 0.11~dB deviation is attained after beam alignment.
Abstract:Automated log analysis is crucial in modern software-intensive systems for ensuring reliability and resilience throughout software maintenance and engineering life cycles. Existing methods perform tasks such as log parsing and log anomaly detection by providing a single prediction value without interpretation. However, given the increasing volume of system events, the limited interpretability of analysis results hinders analysts' trust and their ability to take appropriate actions. Moreover, these methods require substantial in-domain training data, and their performance declines sharply (by up to 62.5%) in online scenarios involving unseen logs from new domains, a common occurrence due to rapid software updates. In this paper, we propose LogPrompt, a novel zero-shot and interpretable log analysis approach. LogPrompt employs large language models (LLMs) to perform zero-shot log analysis tasks via a suite of advanced prompt strategies tailored for log tasks, which enhances LLMs' performance by up to 107.5% compared with simple prompts. Experiments on nine publicly available evaluation datasets across two tasks demonstrate that LogPrompt, despite using no training data, outperforms existing approaches trained on thousands of logs by up to around 50%. We also conduct a human evaluation of LogPrompt's interpretability, with six practitioners possessing over 10 years of experience, who highly rated the generated content in terms of usefulness and readability (averagely 4.42/5). LogPrompt also exhibits remarkable compatibility with open-source and smaller-scale LLMs, making it flexible for practical deployment.
Abstract:Anomaly detection in multivariate time series data is of paramount importance for ensuring the efficient operation of large-scale systems across diverse domains. However, accurately detecting anomalies in such data poses significant challenges. Existing approaches, including forecasting and reconstruction-based methods, struggle to address these challenges effectively. To overcome these limitations, we propose a novel anomaly detection framework named ImDiffusion, which combines time series imputation and diffusion models to achieve accurate and robust anomaly detection. The imputation-based approach employed by ImDiffusion leverages the information from neighboring values in the time series, enabling precise modeling of temporal and inter-correlated dependencies, reducing uncertainty in the data, thereby enhancing the robustness of the anomaly detection process. ImDiffusion further leverages diffusion models as time series imputers to accurately capturing complex dependencies. We leverage the step-by-step denoised outputs generated during the inference process to serve as valuable signals for anomaly prediction, resulting in improved accuracy and robustness of the detection process. We evaluate the performance of ImDiffusion via extensive experiments on benchmark datasets. The results demonstrate that our proposed framework significantly outperforms state-of-the-art approaches in terms of detection accuracy and timeliness. ImDiffusion is further integrated into the real production system in Microsoft and observe a remarkable 11.4% increase in detection F1 score compared to the legacy approach. To the best of our knowledge, ImDiffusion represents a pioneering approach that combines imputation-based techniques with time series anomaly detection, while introducing the novel use of diffusion models to the field.
Abstract:The loss function for bounding box regression (BBR) is essential to object detection. Its good definition will bring significant performance improvement to the model. Most existing works assume that the examples in the training data are high-quality and focus on strengthening the fitting ability of BBR loss. If we blindly strengthen BBR on low-quality examples, it will jeopardize localization performance. Focal-EIoU v1 was proposed to solve this problem, but due to its static focusing mechanism (FM), the potential of non-monotonic FM was not fully exploited. Based on this idea, we propose an IoU-based loss with a dynamic non-monotonic FM named Wise-IoU (WIoU). When WIoU is applied to the state-of-the-art real-time detector YOLOv7, the AP-75 on the MS-COCO dataset is improved from 53.03% to 54.50%.
Abstract:Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, which compose a promising THz ultra-large antenna array (ULAA) system. As a combined effect of wavelength and array aperture, the resulting coverage of THz systems ranges from near-field to far-field, leading to a new paradigm of cross-field communications. Although channel models, communications theories, and networking strategies have been studied for far-field and near-field separately, the unified design of cross-field communications that achieve high spectral efficiency and low complexity is still missing. In this article, the challenges and features of THz ULAA cross-field communications are investigated. Furthermore, cross-field solutions in three perspectives are presented, including a hybrid spherical- and planar-wave channel model, cross-field channel estimation, and widely-spaced multi-subarray hybrid beamforming, where a subarray as a basic unit in THz ULAA systems is exploited. The approximation error of channel modeling accuracy, spectral efficiency, and estimation error of these designs are numerically evaluated. Finally, as a roadmap of THz ULAA cross-field communications, multiple open problems and potential research directions are elaborated.
Abstract:Due to the complicated background and noise of infrared images, infrared small target detection is one of the most difficult problems in the field of computer vision. In most existing studies, semantic segmentation methods are typically used to achieve better results. The centroid of each target is calculated from the segmentation map as the detection result. In contrast, we propose a novel end-to-end framework for infrared small target detection and segmentation in this paper. First, with the use of UNet as the backbone to maintain resolution and semantic information, our model can achieve a higher detection accuracy than other state-of-the-art methods by attaching a simple anchor-free head. Then, a pyramid pool module is used to further extract features and improve the precision of target segmentation. Next, we use semantic segmentation tasks that pay more attention to pixel-level features to assist in the training process of object detection, which increases the average precision and allows the model to detect some targets that were previously not detectable. Furthermore, we develop a multi-task framework for infrared small target detection and segmentation. Our multi-task learning model reduces complexity by nearly half and speeds up inference by nearly twice compared to the composite single-task model, while maintaining accuracy. The code and models are publicly available at https://github.com/Chenastron/MTUNet.
Abstract:Integrated ultra-massive multiple-input multiple-output (UM-MIMO) and intelligent reflecting surface (IRS) systems are promising for 6G and beyond Terahertz (0.1-10 THz) communications, to effectively bypass the barriers of limited coverage and line-of-sight blockage. However, excessive dimensions of UM-MIMO and IRS enlarge the near-field region, while strong THz channel sparsity in far-field is detrimental to spatial multiplexing. Moreover, channel estimation (CE) requires recovering the large-scale channel from severely compressed observations due to limited RF-chains. To tackle these challenges, a hybrid spherical- and planar-wave channel model (HSPM) is developed for the cascaded channel of the integrated system. The spatial multiplexing gains under near-field and far-field regions are analyzed, which are found to be limited by the segmented channel with a lower rank. Furthermore, a compressive sensing-based CE framework is developed, including a sparse channel representation method, a separate-side estimation (SSE) and a dictionary-shrinkage estimation (DSE) algorithms. Numerical results verify the effectiveness of the HSPM, the capacity of which is only $5\times10^{-4}$ bits/s/Hz deviated from that obtained by the ground-truth spherical-wave-model, with 256 elements. While the SSE achieves improved accuracy for CE than benchmark algorithms, the DSE is more attractive in noisy environments, with 0.8 dB lower normalized-mean-square-error than SSE.
Abstract:Millimeter-wave (mmWave) and Terahertz (THz)-band communications exploit the abundant bandwidth to fulfill the increasing data rate demands of 6G wireless communications. To compensate for the high propagation loss with reduced hardware costs, ultra-massive multiple-input multiple-output (UM-MIMO) with a hybrid beamforming structure is a promising technology in the mmWave and THz bands. However, channel estimation (CE) is challenging for hybrid UM-MIMO systems, which requires recovering the high-dimensional channels from severely few channel observations. In this paper, a Pruned Approximate Message Passing (AMP) Integrated Deep Convolutional-neural-network (DCNN) CE (PRINCE) method is firstly proposed, which enhances the estimation accuracy of the AMP method by appending a DCNN network. Moreover, by truncating the insignificant feature maps in the convolutional layers of the DCNN network, a pruning method including training with regularization, pruning and refining procedures is developed to reduce the network scale. Simulation results show that the PRINCE achieves a good trade-off between the CE accuracy and significantly low complexity, with normalized-mean-square-error (NMSE) of $-10$ dB at signal-to-noise-ratio (SNR) as $10$ dB after eliminating $80\%$ feature maps.