Abstract:Structural health monitoring (SHM) is essential for the early detection of infrastructure defects, such as cracks in concrete bridge pier. but often faces challenges in efficiency and accuracy in complex environments. Although the Segment Anything Model (SAM) achieves excellent segmentation performance, its computational demands limit its suitability for real-time applications on edge devices. To address these challenges, this paper proposes Crack-EdgeSAM, a self-prompting crack segmentation system that integrates YOLOv8 for generating prompt boxes and a fine-tuned EdgeSAM model for crack segmentation. To ensure computational efficiency, the method employs ConvLoRA, a Parameter-Efficient Fine-Tuning (PEFT) technique, along with DiceFocalLoss to fine-tune the EdgeSAM model. Our experimental results on public datasets and the climbing robot automatic inspections demonstrate that the system achieves high segmentation accuracy and significantly enhanced inference speed compared to the most recent methods. Notably, the system processes 1024 x 1024 pixels images at 46 FPS on our PC and 8 FPS on Jetson Orin Nano.
Abstract:Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images/videos. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as face recognition or face detection, is unfortunately paid little attention to in literature for face manipulation detection. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information for robust face manipulation detection. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from an RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. We also propose an RGB-Depth Inconsistency Attention (RDIA) module to effectively capture the inter-frame inconsistency for multi-frame input. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.
Abstract:This paper investigates the covert communication in an air-to-ground (A2G) system, where a UAV (Alice) can adopt the omnidirectional microwave (OM) or directional mmWave (DM) transmission mode to transmit covert data to a ground user (Bob) while suffering from the detection of an adversary (Willie). For both the OM and DM modes, we first conduct theoretical analysis to reveal the inherent relationship between the transmit rate/transmit power and basic covert performance metrics in terms of detection error probability (DEP), effective covert rate (ECR), and covert Shannon capacity (CSC). To facilitate the transmission mode selection at Alice, we then explore the optimization of transmit rate and transmit power for ECR/CSC maximization under the OM and DM modes, and further propose a hybrid OM/DM transmission mode which allows the UAV to adaptively select between the OM and DM modes to achieve the maximum ECR and CSC at a given location of UAV. Finally, extensive numerical results are provided to illustrate the covert performances of the concerned A2G system under different transmission modes, and demonstrate that the hybrid OM/DM transmission mode outperforms the pure OM or DM mode in terms of covert performance.
Abstract:Radon transform is widely used in physical and life sciences and one of its major applications is the X-ray computed tomography (X-ray CT), which is significant in modern health examination. The Radon inversion or image reconstruction is challenging due to the potentially defective radon projections. Conventionally, the reconstruction process contains several ad hoc stages to approximate the corresponding Radon inversion. Each of the stages is highly dependent on the results of the previous stage. In this paper, we propose a novel unified framework for Radon inversion via deep learning (DL). The Radon inversion can be approximated by the proposed framework with an end-to-end fashion instead of processing step-by-step with multiple stages. For simplicity, the proposed framework is short as iRadonMap (inverse Radon transform approximation). Specifically, we implement the iRadonMap as an appropriative neural network, of which the architecture can be divided into two segments. In the first segment, a learnable fully-connected filtering layer is used to filter the radon projections along the view-angle direction, which is followed by a learnable sinusoidal back-projection layer to transfer the filtered radon projections into an image. The second segment is a common neural network architecture to further improve the reconstruction performance in the image domain. The iRadonMap is overall optimized by training a large number of generic images from ImageNet database. To evaluate the performance of the iRadonMap, clinical patient data is used. Qualitative results show promising reconstruction performance of the iRadonMap.
Abstract:This paper addresses the problem of predicting popularity of comments in an online discussion forum using reinforcement learning, particularly addressing two challenges that arise from having natural language state and action spaces. First, the state representation, which characterizes the history of comments tracked in a discussion at a particular point, is augmented to incorporate the global context represented by discussions on world events available in an external knowledge source. Second, a two-stage Q-learning framework is introduced, making it feasible to search the combinatorial action space while also accounting for redundancy among sub-actions. We experiment with five Reddit communities, showing that the two methods improve over previous reported results on this task.
Abstract:We introduce an online popularity prediction and tracking task as a benchmark task for reinforcement learning with a combinatorial, natural language action space. A specified number of discussion threads predicted to be popular are recommended, chosen from a fixed window of recent comments to track. Novel deep reinforcement learning architectures are studied for effective modeling of the value function associated with actions comprised of interdependent sub-actions. The proposed model, which represents dependence between sub-actions through a bi-directional LSTM, gives the best performance across different experimental configurations and domains, and it also generalizes well with varying numbers of recommendation requests.
Abstract:This paper introduces a novel architecture for reinforcement learning with deep neural networks designed to handle state and action spaces characterized by natural language, as found in text-based games. Termed a deep reinforcement relevance network (DRRN), the architecture represents action and state spaces with separate embedding vectors, which are combined with an interaction function to approximate the Q-function in reinforcement learning. We evaluate the DRRN on two popular text games, showing superior performance over other deep Q-learning architectures. Experiments with paraphrased action descriptions show that the model is extracting meaning rather than simply memorizing strings of text.
Abstract:Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states. It is in general very challenging to construct and infer hidden states as they often depend on the agent's entire interaction history and may require substantial domain knowledge. In this work, we investigate a deep-learning approach to learning the representation of states in partially observable tasks, with minimal prior knowledge of the domain. In particular, we propose a new family of hybrid models that combines the strength of both supervised learning (SL) and reinforcement learning (RL), trained in a joint fashion: The SL component can be a recurrent neural networks (RNN) or its long short-term memory (LSTM) version, which is equipped with the desired property of being able to capture long-term dependency on history, thus providing an effective way of learning the representation of hidden states. The RL component is a deep Q-network (DQN) that learns to optimize the control for maximizing long-term rewards. Extensive experiments in a direct mailing campaign problem demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous state-of-the-art methods.
Abstract:We develop a fully discriminative learning approach for supervised Latent Dirichlet Allocation (LDA) model using Back Propagation (i.e., BP-sLDA), which maximizes the posterior probability of the prediction variable given the input document. Different from traditional variational learning or Gibbs sampling approaches, the proposed learning method applies (i) the mirror descent algorithm for maximum a posterior inference and (ii) back propagation over a deep architecture together with stochastic gradient/mirror descent for model parameter estimation, leading to scalable and end-to-end discriminative learning of the model. As a byproduct, we also apply this technique to develop a new learning method for the traditional unsupervised LDA model (i.e., BP-LDA). Experimental results on three real-world regression and classification tasks show that the proposed methods significantly outperform the previous supervised topic models, neural networks, and is on par with deep neural networks.