Abstract:Various works have been extensively studied in the research of text-to-image generation. Although existing models perform well in text-to-image generation, there are significant challenges when directly employing them to generate images in dialogs. In this paper, we first highlight a new problem: dialog-to-image generation, that is, given the dialog context, the model should generate a realistic image which is consistent with the specified conversation as response. To tackle the problem, we propose an efficient approach for dialog-to-image generation without any intermediate translation, which maximizes the extraction of the semantic information contained in the dialog. Considering the characteristics of dialog structure, we put segment token before each sentence in a turn of a dialog to differentiate different speakers. Then, we fine-tune pre-trained text-to-image models to enable them to generate images conditioning on processed dialog context. After fine-tuning, our approach can consistently improve the performance of various models across multiple metrics. Experimental results on public benchmark demonstrate the effectiveness and practicability of our method.
Abstract:Humans possess an extraordinary ability to create and utilize tools, allowing them to overcome physical limitations and explore new frontiers. With the advent of foundation models, AI systems have the potential to be equally adept in tool use as humans. This paradigm, i.e., tool learning with foundation models, combines the strengths of specialized tools and foundation models to achieve enhanced accuracy, efficiency, and automation in problem-solving. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors in this field. To this end, we present a systematic investigation of tool learning in this paper. We first introduce the background of tool learning, including its cognitive origins, the paradigm shift of foundation models, and the complementary roles of tools and models. Then we recapitulate existing tool learning research into tool-augmented and tool-oriented learning. We formulate a general tool learning framework: starting from understanding the user instruction, models should learn to decompose a complex task into several subtasks, dynamically adjust their plan through reasoning, and effectively conquer each sub-task by selecting appropriate tools. We also discuss how to train models for improved tool-use capabilities and facilitate the generalization in tool learning. Considering the lack of a systematic tool learning evaluation in prior works, we experiment with 17 representative tools and show the potential of current foundation models in skillfully utilizing tools. Finally, we discuss several open problems that require further investigation for tool learning. Overall, we hope this paper could inspire future research in integrating tools with foundation models.
Abstract:Salient object detection in optical remote sensing image (ORSI-SOD) has gradually attracted attention thanks to the development of deep learning (DL) and salient object detection in natural scene image (NSI-SOD). However, NSI and ORSI are different in many aspects, such as large coverage, complex background, and large differences in target types and scales. Therefore, a new dedicated method is needed for ORSI-SOD. In addition, existing methods do not pay sufficient attention to the boundary of the object, and the completeness of the final saliency map still needs improvement. To address these issues, we propose a novel method called Dual Feedback Attention Framework via Boundary-Aware Auxiliary and Progressive Semantic Optimization (DFA-BASO). First, Boundary Protection Calibration (BPC) module is proposed to reduce the loss of edge position information during forward propagation and suppress noise in low-level features. Second, a Dual Feature Feedback Complementary (DFFC) module is proposed based on BPC module. It aggregates boundary-semantic dual features and provides effective feedback to coordinate features across different layers. Finally, a Strong Semantic Feedback Refinement (SSFR) module is proposed to obtain more complete saliency maps. This module further refines feature representation and eliminates feature differences through a unique feedback mechanism. Extensive experiments on two public datasets show that DFA-BASO outperforms 15 state-of-the-art methods. Furthermore, this paper strongly demonstrates the true contribution of DFA-BASO to ORSI-SOD by in-depth analysis of the visualization figure. All codes can be found at https://github.com/YUHsss/DFA-BASO.
Abstract:Although many large-scale knowledge bases (KBs) claim to contain multilingual information, their support for many non-English languages is often incomplete. This incompleteness gives birth to the task of cross-lingual question answering over knowledge base (xKBQA), which aims to answer questions in languages different from that of the provided KB. One of the major challenges facing xKBQA is the high cost of data annotation, leading to limited resources available for further exploration. Another challenge is mapping KB schemas and natural language expressions in the questions under cross-lingual settings. In this paper, we propose a novel approach for xKBQA in a reading comprehension paradigm. We convert KB subgraphs into passages to narrow the gap between KB schemas and questions, which enables our model to benefit from recent advances in multilingual pre-trained language models (MPLMs) and cross-lingual machine reading comprehension (xMRC). Specifically, we use MPLMs, with considerable knowledge of cross-lingual mappings, for cross-lingual reading comprehension. Existing high-quality xMRC datasets can be further utilized to finetune our model, greatly alleviating the data scarcity issue in xKBQA. Extensive experiments on two xKBQA datasets in 12 languages show that our approach outperforms various baselines and achieves strong few-shot and zero-shot performance. Our dataset and code are released for further research.