Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinyu Wei

Are Large Language Models Good In-context Learners for Financial Sentiment Analysis?

Mar 06, 2025

Xinyu Wei, Luojia Liu

Abstract:Recently, large language models (LLMs) with hundreds of billions of parameters have demonstrated the emergent ability, surpassing traditional methods in various domains even without fine-tuning over domain-specific data. However, when it comes to financial sentiment analysis (FSA)$\unicode{x2013}$a fundamental task in financial AI$\unicode{x2013}$these models often encounter various challenges, such as complex financial terminology, subjective human emotions, and ambiguous inclination expressions. In this paper, we aim to answer the fundamental question: whether LLMs are good in-context learners for FSA? Unveiling this question can yield informative insights on whether LLMs can learn to address the challenges by generalizing in-context demonstrations of financial document-sentiment pairs to the sentiment analysis of new documents, given that finetuning these models on finance-specific data is difficult, if not impossible at all. To the best of our knowledge, this is the first paper exploring in-context learning for FSA that covers most modern LLMs (recently released DeepSeek V3 included) and multiple in-context sample selection methods. Comprehensive experiments validate the in-context learning capability of LLMs for FSA.

* Accepted by ICLR 2025 Workshop on Advances in Financial AI

Via

Access Paper or Ask Questions

MAVIS: Mathematical Visual Instruction Tuning

Jul 11, 2024

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang(+2 more)

Figure 1 for MAVIS: Mathematical Visual Instruction Tuning

Figure 2 for MAVIS: Mathematical Visual Instruction Tuning

Figure 3 for MAVIS: Mathematical Visual Instruction Tuning

Figure 4 for MAVIS: Mathematical Visual Instruction Tuning

Abstract:Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

* Work in progress. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

Via

Access Paper or Ask Questions

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Jun 22, 2024

Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

Figure 1 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Figure 2 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Figure 3 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Figure 4 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Abstract:In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model's prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Apr 01, 2024

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

Figure 1 for Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Figure 2 for Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Figure 3 for Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Figure 4 for Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Abstract:The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.

* 16 pages, 7 figures

Via

Access Paper or Ask Questions

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Mar 21, 2024

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, Ziyong Feng

Abstract:Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.

* 14 pages, 15 figures

Via

Access Paper or Ask Questions

Cloud-Device Collaborative Learning for Multimodal Large Language Models

Dec 26, 2023

Guanqun Wang, Jiaming Liu, Chenxuan Li, Junpeng Ma, Yuan Zhang, Xinyu Wei, Kevin Zhang, Maurice Chong, Ray Zhang, Yijiang Liu(+1 more)

Figure 1 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Figure 2 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Figure 3 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Figure 4 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Abstract:The burgeoning field of Multimodal Large Language Models (MLLMs) has exhibited remarkable performance in diverse tasks such as captioning, commonsense reasoning, and visual scene understanding. However, the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters, leading to a notable decline in generalization capabilities when these models are compressed for device deployment. Addressing this challenge, we introduce a Cloud-Device Collaborative Continual Adaptation framework, designed to enhance the performance of compressed, device-deployed MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment. In the uplink phase, we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively filter out-of-distribution tokens, thereby reducing transmission costs and improving training efficiency. On the cloud side, we propose Adapter-based Knowledge Distillation (AKD) method to transfer refined knowledge from large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic Weight update Compression (DWC) strategy for the downlink, which adaptively selects and quantizes updated weight parameters, enhancing transmission efficiency and reducing the representational disparity between cloud and device models. Extensive experiments on several multimodal benchmarks demonstrate the superiority of our proposed framework over prior Knowledge Distillation and device-cloud collaboration methods. Notably, we also validate the feasibility of our approach to real-world experiments.

Via

Access Paper or Ask Questions

Accretionary Learning with Deep Neural Networks

Nov 21, 2021

Xinyu Wei, Biing-Hwang Fred Juang, Ouya Wang, Shenglong Zhou, Geoffrey Ye Li

Figure 1 for Accretionary Learning with Deep Neural Networks

Figure 2 for Accretionary Learning with Deep Neural Networks

Figure 3 for Accretionary Learning with Deep Neural Networks

Figure 4 for Accretionary Learning with Deep Neural Networks

Abstract:One of the fundamental limitations of Deep Neural Networks (DNN) is its inability to acquire and accumulate new cognitive capabilities. When some new data appears, such as new object classes that are not in the prescribed set of objects being recognized, a conventional DNN would not be able to recognize them due to the fundamental formulation that it takes. The current solution is typically to re-design and re-learn the entire network, perhaps with a new configuration, from a newly expanded dataset to accommodate new knowledge. This process is quite different from that of a human learner. In this paper, we propose a new learning method named Accretionary Learning (AL) to emulate human learning, in that the set of objects to be recognized may not be pre-specified. The corresponding learning structure is modularized, which can dynamically expand to register and use new knowledge. During accretionary learning, the learning process does not require the system to be totally re-designed and re-trained as the set of objects grows in size. The proposed DNN structure does not forget previous knowledge when learning to recognize new data classes. We show that the new structure and the design methodology lead to a system that can grow to cope with increased cognitive complexity while providing stable and superior overall performance.

Via

Access Paper or Ask Questions

Spatiotemporal Attention Networks for Wind Power Forecasting

Sep 14, 2019

Xingbo Fu, Feng Gao, Jiang Wu, Xinyu Wei, Fangwei Duan

Figure 1 for Spatiotemporal Attention Networks for Wind Power Forecasting

Figure 2 for Spatiotemporal Attention Networks for Wind Power Forecasting

Figure 3 for Spatiotemporal Attention Networks for Wind Power Forecasting

Figure 4 for Spatiotemporal Attention Networks for Wind Power Forecasting

Abstract:Wind power is one of the most important renewable energy sources and accurate wind power forecasting is very significant for reliable and economic power system operation and control strategies. This paper proposes a novel framework with spatiotemporal attention networks (STAN) for wind power forecasting. This model captures spatial correlations among wind farms and temporal dependencies of wind power time series. First of all, we employ a multi-head self-attention mechanism to extract spatial correlations among wind farms. Then, temporal dependencies are captured by the Sequence-to-Sequence (Seq2Seq) model with a global attention mechanism. Finally, experimental results demonstrate that our model achieves better performance than other baseline approaches. Our work provides useful insights to capture non-Euclidean spatial correlations.

Via

Access Paper or Ask Questions