Abstract:In recent years, origin-destination (OD) demand prediction has gained significant attention for its profound implications in urban development. Existing data-driven deep learning methods primarily focus on the spatial or temporal dependency between regions yet neglecting regions' fundamental functional difference. Though knowledge-driven physical methods have characterised regions' functions by their radiation and attraction capacities, these functions are defined on numerical factors like population without considering regions' intrinsic nominal attributes, e.g., a region is a residential or industrial district. Moreover, the complicated relationships between two types of capacities, e.g., the radiation capacity of a residential district in the morning will be transformed into the attraction capacity in the evening, are totally missing from physical methods. In this paper, we not only generalize the physical radiation and attraction capacities into the deep learning framework with the extended capability to fulfil regions' functions, but also present a new model that captures the relationships between two types of capacities. Specifically, we first model regions' radiation and attraction capacities using a bilateral branch network, each equipped with regions' attribute representations. We then describe the transformation relationship of different capacities of the same region using a hypergraph-based parameter generation method. We finally unveil the competition relationship of different regions with the same attraction capacity through cluster-based adversarial learning. Extensive experiments on two datasets demonstrate the consistent improvements of our method over the state-of-the-art baselines, as well as the good explainability of regions' functions using their nominal attributes.
Abstract:Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
Abstract:In the remote sensing community, multimodal change detection (MCD) is particularly critical due to its ability to track changes across different imaging conditions and sensor types, making it highly applicable to a wide range of real-world scenarios. This paper focuses on addressing the challenges of MCD, especially the difficulty in comparing images from different sensors with varying styles and statistical characteristics of geospatial objects. Traditional MCD methods often struggle with these variations, leading to inaccurate and unreliable results. To overcome these limitations, a novel unsupervised cross-domain separable translation network (CSTN) is proposed, which uniquely integrates a within-domain self-reconstruction and a cross-domain image translation and cycle-reconstruction workflow with change detection constraints. The model is optimized by implementing both the tasks of image translation and MCD simultaneously, thereby guaranteeing the comparability of learned features from multimodal images. Specifically, a simple yet efficient dual-branch convolutional architecture is employed to separate the content and style information of multimodal images. This process generates a style-independent content-comparable feature space, which is crucial for achieving accurate change detection even in the presence of significant sensor variations. Extensive experimental results demonstrate the effectiveness of the proposed method, showing remarkable improvements over state-of-the-art approaches in terms of accuracy and efficacy for MCD. The implementation of our method will be publicly available at \url{https://github.com/OMEGA-RS/CSTN}
Abstract:This work introduces Weaver, our first family of large language models (LLMs) dedicated to content creation. Weaver is pre-trained on a carefully selected corpus that focuses on improving the writing capabilities of large language models. We then fine-tune Weaver for creative and professional writing purposes and align it to the preference of professional writers using a suit of novel methods for instruction data synthesis and LLM alignment, making it able to produce more human-like texts and follow more diverse instructions for content creation. The Weaver family consists of models of Weaver Mini (1.8B), Weaver Base (6B), Weaver Pro (14B), and Weaver Ultra (34B) sizes, suitable for different applications and can be dynamically dispatched by a routing agent according to query complexity to balance response quality and computation cost. Evaluation on a carefully curated benchmark for assessing the writing capabilities of LLMs shows Weaver models of all sizes outperform generalist LLMs several times larger than them. Notably, our most-capable Weaver Ultra model surpasses GPT-4, a state-of-the-art generalist LLM, on various writing scenarios, demonstrating the advantage of training specialized LLMs for writing purposes. Moreover, Weaver natively supports retrieval-augmented generation (RAG) and function calling (tool usage). We present various use cases of these abilities for improving AI-assisted writing systems, including integration of external knowledge bases, tools, or APIs, and providing personalized writing assistance. Furthermore, we discuss and summarize a guideline and best practices for pre-training and fine-tuning domain-specific LLMs.
Abstract:Previous super-resolution reconstruction (SR) works are always designed on the assumption that the degradation operation is fixed, such as bicubic downsampling. However, as for remote sensing images, some unexpected factors can cause the blurred visual performance, like weather factors, orbit altitude, etc. Blind SR methods are proposed to deal with various degradations. There are two main challenges of blind SR in RSIs: 1) the accu-rate estimation of degradation kernels; 2) the realistic image generation in the ill-posed problem. To rise to the challenge, we propose a novel blind SR framework based on dual conditional denoising diffusion probabilistic models (DDSR). In our work, we introduce conditional denoising diffusion probabilistic models (DDPM) from two aspects: kernel estimation progress and re-construction progress, named as the dual-diffusion. As for kernel estimation progress, conditioned on low-resolution (LR) images, a new DDPM-based kernel predictor is constructed by studying the invertible mapping between the kernel distribution and the latent distribution. As for reconstruction progress, regarding the predicted degradation kernels and LR images as conditional information, we construct a DDPM-based reconstructor to learning the mapping from the LR images to HR images. Com-prehensive experiments show the priority of our proposal com-pared with SOTA blind SR methods. Source Code is available at https://github.com/Lincoln20030413/DDSR
Abstract:This paper described the PCG-AIID system for L3DAS22 challenge in Task 1: 3D speech enhancement in office reverberant environment. We proposed a two-stage framework to address multi-channel speech denoising and dereverberation. In the first stage, a multiple input and multiple output (MIMO) network is applied to remove background noise while maintaining the spatial characteristics of multi-channel signals. In the second stage, a multiple input and single output (MISO) network is applied to enhance the speech from desired direction and post-filtering. As a result, our system ranked 3rd place in ICASSP2022 L3DAS22 challenge and significantly outperforms the baseline system, while achieving 3.2% WER and 0.972 STOI on the blind test-set.
Abstract:The generalization error bound of support vector machine (SVM) depends on the ratio of radius and margin, while standard SVM only considers the maximization of the margin but ignores the minimization of the radius. Several approaches have been proposed to integrate radius and margin for joint learning of feature transformation and SVM classifier. However, most of them either require the form of the transformation matrix to be diagonal, or are non-convex and computationally expensive. In this paper, we suggest a novel approximation for the radius of minimum enclosing ball (MEB) in feature space, and then propose a convex radius-margin based SVM model for joint learning of feature transformation and SVM classifier, i.e., F-SVM. An alternating minimization method is adopted to solve the F-SVM model, where the feature transformation is updatedvia gradient descent and the classifier is updated by employing the existing SVM solver. By incorporating with kernel principal component analysis, F-SVM is further extended for joint learning of nonlinear transformation and classifier. Experimental results on the UCI machine learning datasets and the LFW face datasets show that F-SVM outperforms the standard SVM and the existing radius-margin based SVMs, e.g., RMM, R-SVM+ and R-SVM+{\mu}.