Abstract:Transformers have excelled in natural language processing and computer vision, paving their way to sensor-based Human Activity Recognition (HAR). Previous studies show that transformers outperform their counterparts exclusively when they harness abundant data or employ compute-intensive optimization algorithms. However, neither of these scenarios is viable in sensor-based HAR due to the scarcity of data in this field and the frequent need to perform training and inference on resource-constrained devices. Our extensive investigation into various implementations of transformer-based versus non-transformer-based HAR using wearable sensors, encompassing more than 500 experiments, corroborates these concerns. We observe that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices. Additionally, transformers demonstrate lower robustness to adversarial attacks, posing a potential threat to user trust in HAR.
Abstract:Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.
Abstract:In this paper, we study the shortest path problem (SPP) with multiple source-destination pairs (MSD), namely MSD-SPP, to minimize average travel time of all shortest paths. The inherent traffic capacity limits within a road network contributes to the competition among vehicles. Multi-agent reinforcement learning (MARL) model cannot offer effective and efficient path planning cooperation due to the asynchronous decision making setting in MSD-SPP, where vehicles (a.k.a agents) cannot simultaneously complete routing actions in the previous time step. To tackle the efficiency issue, we propose to divide an entire road network into multiple sub-graphs and subsequently execute a two-stage process of inter-region and intra-region route planning. To address the asynchronous issue, in the proposed asyn-MARL framework, we first design a global state, which exploits a low-dimensional vector to implicitly represent the joint observations and actions of multi-agents. Then we develop a novel trajectory collection mechanism to decrease the redundancy in training trajectories. Additionally, we design a novel actor network to facilitate the cooperation among vehicles towards the same or close destinations and a reachability graph aimed at preventing infinite loops in routing paths. On both synthetic and real road networks, our evaluation result demonstrates that our approach outperforms state-of-the-art planning approaches.
Abstract:Under the framework of network-based neurodegeneration, brain functional connectome (FC)-based Graph Neural Networks (GNN) have emerged as a valuable tool for the diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's disease (AD). However, these models are tailored for brain FC at a single time point instead of characterizing FC trajectory. Discerning how FC evolves with disease progression, particularly at the predementia stages such as cognitively normal individuals with amyloid deposition or individuals with mild cognitive impairment (MCI), is crucial for delineating disease spreading patterns and developing effective strategies to slow down or even halt disease advancement. In this work, we proposed the first interpretable framework for brain FC trajectory embedding with application to neurodegenerative disease diagnosis and prognosis, namely Brain Tokenized Graph Transformer (Brain TokenGT). It consists of two modules: 1) Graph Invariant and Variant Embedding (GIVE) for generation of node and spatio-temporal edge embeddings, which were tokenized for downstream processing; 2) Brain Informed Graph Transformer Readout (BIGTR) which augments previous tokens with trainable type identifiers and non-trainable node identifiers and feeds them into a standard transformer encoder to readout. We conducted extensive experiments on two public longitudinal fMRI datasets of the AD continuum for three tasks, including differentiating MCI from controls, predicting dementia conversion in MCI, and classification of amyloid positive or negative cognitively normal individuals. Based on brain FC trajectory, the proposed Brain TokenGT approach outperformed all the other benchmark models and at the same time provided excellent interpretability. The code is available at https://github.com/ZijianD/Brain-TokenGT.git
Abstract:This study aimed to solve the semantic gap and misalignment issue between encoding and decoding because of multiple convolutional and pooling operations in U-Net when segmenting subabdominal MRI images during rectal cancer treatment. A MRI Image Segmentation is proposed based on a multi-scale feature pyramid network and dual attention mechanism. Our innovation is the design of two modules: 1) a dilated convolution and multi-scale feature pyramid network are used in the encoding to avoid the semantic gap. 2) a dual attention mechanism is designed to maintain spatial information of U-Net and reduce misalignment. Experiments on a subabdominal MRI image dataset show the proposed method achieves better performance than others methods. In conclusion, a multi-scale feature pyramid network can reduce the semantic gap, and the dual attention mechanism can make an alignment of features between encoding and decoding.
Abstract:Personalized federated learning considers learning models unique to each client in a heterogeneous network. The resulting client-specific models have been purported to improve metrics such as accuracy, fairness, and robustness in federated networks. However, despite a plethora of work in this area, it remains unclear: (1) which personalization techniques are most effective in various settings, and (2) how important personalization truly is for realistic federated applications. To better answer these questions, we propose Motley, a benchmark for personalized federated learning. Motley consists of a suite of cross-device and cross-silo federated datasets from varied problem domains, as well as thorough evaluation metrics for better understanding the possible impacts of personalization. We establish baselines on the benchmark by comparing a number of representative personalized federated learning methods. These initial results highlight strengths and weaknesses of existing approaches, and raise several open questions for the community. Motley aims to provide a reproducible means with which to advance developments in personalized and heterogeneity-aware federated learning, as well as the related areas of transfer learning, meta-learning, and multi-task learning.
Abstract:Autonomous driving requires 3D maps that provide accurate and up-to-date information about semantic landmarks. Due to the wider availability and lower cost of cameras compared with laser scanners, vision-based mapping has attracted much attention from academia and industry. Among the existing solutions, Structure-from-Motion (SfM) technology has proved to be feasible for building 3D maps from crowdsourced data, since it allows unordered images as input. Previous works on SfM have mainly focused on issues related to building 3D point clouds and calculating camera poses, leaving the issues of automatic change detection and localization open. We propose in this paper an SfM-based solution for automatic map update, with a focus on real-time change detection and localization. Our solution builds on comparison of semantic map data (e.g. types and locations of traffic signs). Through a novel design of the pixel-wise 3D localization algorithm, our system can locate the objects detected from 2D images in a 3D space, utilizing sparse SfM point clouds. Experiments with dashcam videos collected from two urban areas prove that the system is able to locate visible traffic signs in front along the driving direction with a median distance error of 1.52 meters. Moreover, it can detect up to 80\% of the changes with a median distance error of 2.21 meters. The result analysis also shows the potential of significantly improving the system performance in the future by increasing the accuracy of the background technology in use, including in particularly the object detection and point cloud geo-registration algorithms.
Abstract:Vehicular fog computing (VFC) pushes the cloud computing capability to the distributed fog nodes at the edge of the Internet, enabling compute-intensive and latency-sensitive computing services for vehicles through task offloading. However, a heterogeneous mobility environment introduces uncertainties in terms of resource supply and demand, which are inevitable bottlenecks for the optimal offloading decision. Also, these uncertainties bring extra challenges to task offloading under the oblivious adversary attack and data privacy risks. In this article, we develop a new adversarial online algorithm with bandit feedback based on the adversarial multi-armed bandit theory, to enable scalable and low-complex offloading decision making on the fog node selection toward minimizing the offloading service cost in terms of delay and energy. The key is to implicitly tune exploration bonus in selection and assessment rules of the designed algorithm, taking into account volatile resource supply and demand. We theoretically prove that the input-size dependent selection rule allows to choose a suitable fog node without exploring the sub-optimal actions, and also an appropriate score patching rule allows to quickly adapt to evolving circumstances, which reduces variance and bias simultaneously, thereby achieving better exploitation exploration balance. Simulation results verify the effectiveness and robustness of the proposed algorithm.
Abstract:Accurate traffic speed prediction is an important and challenging topic for transportation planning. Previous studies on traffic speed prediction predominately used spatio-temporal and context features for prediction. However, they have not made good use of the impact of urban traffic incidents. In this work, we aim to make use of the information of urban incidents to achieve a better prediction of traffic speed. Our incident-driven prediction framework consists of three processes. First, we propose a critical incident discovery method to discover urban traffic incidents with high impact on traffic speed. Second, we design a binary classifier, which uses deep learning methods to extract the latent incident impact features from the middle layer of the classifier. Combining above methods, we propose a Deep Incident-Aware Graph Convolutional Network (DIGC-Net) to effectively incorporate urban traffic incident, spatio-temporal, periodic and context features for traffic speed prediction. We conduct experiments on two real-world urban traffic datasets of San Francisco and New York City. The results demonstrate the superior performance of our model compare to the competing benchmarks.