Abstract:Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism to capture fine-grained local dependencies in human pose sequences, complementing the global dependencies modeled by the Transformer. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance, with P1 errors of 37.4 mm and 13.2 mm respectively, outperforming existing methods in both accuracy and generalization. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at https://github.com/poker-xu/SSR-STF.
Abstract:Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes. Effective detection relies on capturing local contextual information at the appropriate scale. However, small-kernel CNNs have limited receptive fields, leading to false alarms, while transformer models, with global receptive fields, often treat small targets as noise, resulting in miss-detections. Hybrid models struggle to bridge the semantic gap between CNNs and transformers, causing high complexity.To address these challenges, we propose LCRNet, a novel method that learns dynamic local context representations for ISTD. The model consists of three components: (1) C2FBlock, inspired by PDE solvers, for efficient small target information capture; (2) DLC-Attention, a large-kernel attention mechanism that dynamically builds context and reduces feature redundancy; and (3) HLKConv, a hierarchical convolution operator based on large-kernel decomposition that preserves sparsity and mitigates the drawbacks of dilated convolutions. Despite its simplicity, with only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.Experiments on multiple datasets, comparing LCRNet with 33 SOTA methods, demonstrate its superior performance and efficiency.
Abstract:Classical demand modeling analyzes travel behavior using only low-dimensional numeric data (i.e. sociodemographics and travel attributes) but not high-dimensional urban imagery. However, travel behavior depends on the factors represented by both numeric data and urban imagery, thus necessitating a synergetic framework to combine them. This study creates a theoretical framework of deep hybrid models with a crossing structure consisting of a mixing operator and a behavioral predictor, thus integrating the numeric and imagery data into a latent space. Empirically, this framework is applied to analyze travel mode choice using the MyDailyTravel Survey from Chicago as the numeric inputs and the satellite images as the imagery inputs. We found that deep hybrid models outperform both the traditional demand models and the recent deep learning in predicting the aggregate and disaggregate travel behavior with our supervision-as-mixing design. The latent space in deep hybrid models can be interpreted, because it reveals meaningful spatial and social patterns. The deep hybrid models can also generate new urban images that do not exist in reality and interpret them with economic theory, such as computing substitution patterns and social welfare changes. Overall, the deep hybrid models demonstrate the complementarity between the low-dimensional numeric and high-dimensional imagery data and between the traditional demand modeling and recent deep learning. It generalizes the latent classes and variables in classical hybrid demand models to a latent space, and leverages the computational power of deep learning for imagery while retaining the economic interpretability on the microeconomics foundation.
Abstract:As an important task for the management of bike sharing systems, accurate forecast of travel demand could facilitate dispatch and relocation of bicycles to improve user satisfaction. In recent years, many deep learning algorithms have been introduced to improve bicycle usage forecast. A typical practice is to integrate convolutional (CNN) and recurrent neural network (RNN) to capture spatial-temporal dependency in historical travel demand. For typical CNN, the convolution operation is conducted through a kernel that moves across a "matrix-format" city to extract features over spatially adjacent urban areas. This practice assumes that areas close to each other could provide useful information that improves prediction accuracy. However, bicycle usage in neighboring areas might not always be similar, given spatial variations in built environment characteristics and travel behavior that affect cycling activities. Yet, areas that are far apart can be relatively more similar in temporal usage patterns. To utilize the hidden linkage among these distant urban areas, the study proposes an irregular convolutional Long-Short Term Memory model (IrConv+LSTM) to improve short-term bike sharing demand forecast. The model modifies traditional CNN with irregular convolutional architecture to extract dependency among "semantic neighbors". The proposed model is evaluated with a set of benchmark models in five study sites, which include one dockless bike sharing system in Singapore, and four station-based systems in Chicago, Washington, D.C., New York, and London. We find that IrConv+LSTM outperforms other benchmark models in the five cities. The model also achieves superior performance in areas with varying levels of bicycle usage and during peak periods. The findings suggest that "thinking beyond spatial neighbors" can further improve short-term travel demand prediction of urban bike sharing systems.
Abstract:Temporal point process is widely used for sequential data modeling. In this paper, we focus on the problem of modeling sequential event propagation in graph, such as retweeting by social network users, news transmitting between websites, etc. Given a collection of event propagation sequences, conventional point process model consider only the event history, i.e. embed event history into a vector, not the latent graph structure. We propose a Graph Biased Temporal Point Process (GBTPP) leveraging the structural information from graph representation learning, where the direct influence between nodes and indirect influence from event history is modeled respectively. Moreover, the learned node embedding vector is also integrated into the embedded event history as side information. Experiments on a synthetic dataset and two real-world datasets show the efficacy of our model compared to conventional methods and state-of-the-art.
Abstract:Modern power grids are experiencing grand challenges caused by the stochastic and dynamic nature of growing renewable energy and demand response. Traditional theoretical assumptions and operational rules may be violated, which are difficult to be adapted by existing control systems due to the lack of computational power and accurate grid models for use in real time, leading to growing concerns in the secure and economic operation of the power grid. Existing operational control actions are typically determined offline, which are less optimized. This paper presents a novel paradigm, Grid Mind, for autonomous grid operational controls using deep reinforcement learning. The proposed AI agent for voltage control can learn its control policy through interactions with massive offline simulations, and adapts its behavior to new changes including not only load/generation variations but also topological changes. A properly trained agent is tested on the IEEE 14-bus system with tens of thousands of scenarios, and promising performance is demonstrated in applying autonomous voltage controls for secure grid operation.