Abstract:The predictions of click through rate (CTR) and conversion rate (CVR) play a crucial role in the success of ad-recommendation systems. A Deep Hierarchical Ensemble Network (DHEN) has been proposed to integrate multiple feature crossing modules and has achieved great success in CTR prediction. However, its performance for CVR prediction is unclear in the conversion ads setting, where an ad bids for the probability of a user's off-site actions on a third party website or app, including purchase, add to cart, sign up, etc. A few challenges in DHEN: 1) What feature-crossing modules (MLP, DCN, Transformer, to name a few) should be included in DHEN? 2) How deep and wide should DHEN be to achieve the best trade-off between efficiency and efficacy? 3) What hyper-parameters to choose in each feature-crossing module? Orthogonal to the model architecture, the input personalization features also significantly impact model performance with a high degree of freedom. In this paper, we attack this problem and present our contributions biased to the applied data science side, including: First, we propose a multitask learning framework with DHEN as the single backbone model architecture to predict all CVR tasks, with a detailed study on how to make DHEN work effectively in practice; Second, we build both on-site real-time user behavior sequences and off-site conversion event sequences for CVR prediction purposes, and conduct ablation study on its importance; Last but not least, we propose a self-supervised auxiliary loss to predict future actions in the input sequence, to help resolve the label sparseness issue in CVR prediction. Our method achieves state-of-the-art performance compared to previous single feature crossing modules with pre-trained user personalization features.
Abstract:Accurate predictions of spatio-temporal systems' states are crucial for tasks such as system management, control, and crisis prevention. However, the inherent time variance of spatio-temporal systems poses challenges to achieving accurate predictions whenever stationarity is not granted. To address non-stationarity frameworks, we propose a Distribution and Relation Adaptive Network (DRAN) capable of dynamically adapting to relation and distribution changes over time. While temporal normalization and de-normalization are frequently used techniques to adapt to distribution shifts, this operation is not suitable for the spatio-temporal context as temporal normalization scales the time series of nodes and possibly disrupts the spatial relations among nodes. In order to address this problem, we develop a Spatial Factor Learner (SFL) module that enables the normalization and de-normalization process in spatio-temporal systems. To adapt to dynamic changes in spatial relationships among sensors, we propose a Dynamic-Static Fusion Learner (DSFL) module that effectively integrates features learned from both dynamic and static relations through an adaptive fusion ratio mechanism. Furthermore, we introduce a Stochastic Learner to capture the noisy components of spatio-temporal representations. Our approach outperforms state of the art methods in weather prediction and traffic flows forecasting tasks. Experimental results show that our SFL efficiently preserves spatial relationships across various temporal normalization operations. Visualizations of the learned dynamic and static relations demonstrate that DSFL can capture both local and distant relationships between nodes. Moreover, ablation studies confirm the effectiveness of each component.
Abstract:Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, we introduce the nonlinear Causal Kernel Clustering method designed for heterogeneous subgroup causal learning, illuminating variations in causal relationships across diverse subgroups. It comprises two primary components. First, the construction of a sample mapping function forms the basis of the subsequent nonlinear causal kernel. This function assesses the differences in potential nonlinear causal relationships in various samples, supported by our causal identifiability theory. Second, a nonlinear causal kernel is proposed for clustering heterogeneous subgroups. Experimental results showcase the exceptional performance of our method in accurately identifying heterogeneous subgroups and effectively enhancing causal learning, leading to a great reduction in prediction error.
Abstract:Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudo-labels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open-source our code and models at https://github.com/XiaRho/MADM.
Abstract:Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.
Abstract:With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
Abstract:With the remarkable success of generative models like ChatGPT, Artificial Intelligence Generated Content (AIGC) is undergoing explosive development. Not limited to text and images, generative models can generate industrial time series data, addressing challenges such as the difficulty of data collection and data annotation. Due to their outstanding generation ability, they have been widely used in Internet of Things, metaverse, and cyber-physical-social systems to enhance the efficiency of industrial production. In this paper, we present a comprehensive overview of generative models for industrial time series from deep generative models (DGMs) to large generative models (LGMs). First, a DGM-based AIGC framework is proposed for industrial time series generation. Within this framework, we survey advanced industrial DGMs and present a multi-perspective categorization. Furthermore, we systematically analyze the critical technologies required to construct industrial LGMs from four aspects: large-scale industrial dataset, LGMs architecture for complex industrial characteristics, self-supervised training for industrial time series, and fine-tuning of industrial downstream tasks. Finally, we conclude the challenges and future directions to enable the development of generative models in industry.
Abstract:Computer vision tasks are crucial for aerospace missions as they help spacecraft to understand and interpret the space environment, such as estimating position and orientation, reconstructing 3D models, and recognizing objects, which have been extensively studied to successfully carry out the missions. However, traditional methods like Kalman Filtering, Structure from Motion, and Multi-View Stereo are not robust enough to handle harsh conditions, leading to unreliable results. In recent years, deep learning (DL)-based perception technologies have shown great potential and outperformed traditional methods, especially in terms of their robustness to changing environments. To further advance DL-based aerospace perception, various frameworks, datasets, and strategies have been proposed, indicating significant potential for future applications. In this survey, we aim to explore the promising techniques used in perception tasks and emphasize the importance of DL-based aerospace perception. We begin by providing an overview of aerospace perception, including classical space programs developed in recent years, commonly used sensors, and traditional perception methods. Subsequently, we delve into three fundamental perception tasks in aerospace missions: pose estimation, 3D reconstruction, and recognition, as they are basic and crucial for subsequent decision-making and control. Finally, we discuss the limitations and possibilities in current research and provide an outlook on future developments, including the challenges of working with limited datasets, the need for improved algorithms, and the potential benefits of multi-source information fusion.
Abstract:Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of night images on trainable networks. In this paper, we propose a self-supervised nighttime monocular depth estimation method that does not use any night images during training. Our framework utilizes day images as a stable source for self-supervision and applies physical priors (e.g., wave optics, reflection model and read-shot noise model) to compensate for some key day-night differences. With day-to-night data distribution compensation, our framework can be trained in an efficient one-stage self-supervised manner. Though no nighttime images are considered during training, qualitative and quantitative results demonstrate that our method achieves SoTA depth estimating results on the challenging nuScenes-Night and RobotCar-Night compared with existing methods.
Abstract:Large-scale overlapping problems are prevalent in practical engineering applications, and the optimization challenge is significantly amplified due to the existence of shared variables. Decomposition-based cooperative coevolution (CC) algorithms have demonstrated promising performance in addressing large-scale overlapping problems. However, current CC frameworks designed for overlapping problems rely on grouping methods for the identification of overlapping problem structures and the current grouping methods for large-scale overlapping problems fail to consider both accuracy and efficiency simultaneously. In this article, we propose a two-stage enhanced grouping method for large-scale overlapping problems, called OEDG, which achieves accurate grouping while significantly reducing computational resource consumption. In the first stage, OEDG employs a grouping method based on the finite differences principle to identify all subcomponents and shared variables. In the second stage, we propose two grouping refinement methods, called subcomponent union detection (SUD) and subcomponent detection (SD), to enhance and refine the grouping results. SUD examines the information of the subcomponents and shared variables obtained in the previous stage, and SD corrects inaccurate grouping results. To better verify the performance of the proposed OEDG, we propose a series of novel benchmarks that consider various properties of large-scale overlapping problems, including the topology structure, overlapping degree, and separability. Extensive experimental results demonstrate that OEDG is capable of accurately grouping different types of large-scale overlapping problems while consuming fewer computational resources. Finally, we empirically verify that the proposed OEDG can effectively improve the optimization performance of diverse large-scale overlapping problems.