Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pingjian Zhang

PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering

Jun 11, 2025

Xiujun Zhou, Pingjian Zhang, Deyou Tang

Abstract:Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models' generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.

* 13 pages, 7 figures, 5 tables

Via

Access Paper or Ask Questions

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Jan 26, 2024

Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, Yan Song

Abstract:Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.Furthermore, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at \href{https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/}{this https URL}, fostering further research and collaboration in this domain.

* Taiyi-Diffusion-XL Tech Report

Via

Access Paper or Ask Questions

Unified Lattice Graph Fusion for Chinese Named Entity Recognition

Dec 28, 2023

Dixiang Zhang, Junyu Lu, Pingjian Zhang

Figure 1 for Unified Lattice Graph Fusion for Chinese Named Entity Recognition

Figure 2 for Unified Lattice Graph Fusion for Chinese Named Entity Recognition

Figure 3 for Unified Lattice Graph Fusion for Chinese Named Entity Recognition

Figure 4 for Unified Lattice Graph Fusion for Chinese Named Entity Recognition

Abstract:Integrating lexicon into character-level sequence has been proven effective to leverage word boundary and semantic information in Chinese named entity recognition (NER). However, prior approaches usually utilize feature weighting and position coupling to integrate word information, but ignore the semantic and contextual correspondence between the fine-grained semantic units in the character-word space. To solve this issue, we propose a Unified Lattice Graph Fusion (ULGF) approach for Chinese NER. ULGF can explicitly capture various semantic and boundary relations across different semantic units with the adjacency matrix by converting the lattice structure into a unified graph. We stack multiple graph-based intra-source self-attention and inter-source cross-gating fusion layers that iteratively carry out semantic interactions to learn node representations. To alleviate the over-reliance on word information, we further propose to leverage lexicon entity classification as an auxiliary task. Experiments on four Chinese NER benchmark datasets demonstrate the superiority of our ULGF approach.

Via

Access Paper or Ask Questions

iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design

Dec 19, 2023

Ruyi Gan, Xiaojun Wu, Junyu Lu, Yuanhe Tian, Dixiang Zhang, Ziwei Wu, Renliang Sun, Chang Liu, Jiaxing Zhang, Pingjian Zhang(+1 more)

Abstract:With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.

Via

Access Paper or Ask Questions

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

Dec 08, 2023

Junyu Lu, Ruyi Gan, Dixiang Zhang, Xiaojun Wu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, Yan Song

Abstract:Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. However, the absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. In this paper, we propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration. Building on the foundation of BLIP-2, Lyrics infuses local visual features extracted from a visual refiner that includes image tagging, object detection and semantic segmentation modules into the Querying Transformer, while on the text side, the language inputs equip the boundary boxes and tags derived from the visual refiner. We further introduce a two-stage training scheme, in which the pre-training stage bridges the modality gap through explicit and comprehensive vision-language alignment targets. During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects. Our approach achieves strong performance on 13 held-out datasets across various vision-language tasks, and demonstrates promising multi-modal understanding and detailed depiction capabilities in real dialogue scenarios.

Via

Access Paper or Ask Questions

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Oct 29, 2023

Junyu Lu, Dixiang Zhang, Xiaojun Wu, Xinyu Gao, Ruyi Gan, Jiaxing Zhang, Yan Song, Pingjian Zhang

Figure 1 for Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Figure 2 for Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Figure 3 for Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Figure 4 for Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Abstract:Recent advancements enlarge the capabilities of large language models (LLMs) in zero-shot image-to-text generation and understanding by integrating multi-modal inputs. However, such success is typically limited to English scenarios due to the lack of large-scale and high-quality non-English multi-modal resources, making it extremely difficult to establish competitive counterparts in other languages. In this paper, we introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs) designed to incorporate visual semantics into LLM for multi-modal dialogue. Composed of Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes such as instruction tuning, multi-stage training and low-rank adaptation module for visual-language alignment. In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese and generating instruction-response through the in-context learning method. The experiment results demonstrate that compared to the existing LVLMs, Ziya-Visual achieves competitive performance across a wide range of English-only tasks including zero-shot image-text retrieval, image captioning, and visual question answering. The evaluation leaderboard accessed by GPT-4 also indicates that our models possess satisfactory image-text understanding and generation capabilities in Chinese multi-modal scenario dialogues. Code, demo and models are available at ~\url{https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1}.

Via

Access Paper or Ask Questions

UniEX: An Effective and Efficient Framework for Unified Information Extraction via a Span-extractive Perspective

May 22, 2023

Ping Yang, Junyu Lu, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Jiaxing Zhang, Pingjian Zhang

Abstract:We propose a new paradigm for universal information extraction (IE) that is compatible with any schema format and applicable to a list of IE tasks, such as named entity recognition, relation extraction, event extraction and sentiment analysis. Our approach converts the text-based IE tasks as the token-pair problem, which uniformly disassembles all extraction targets into joint span detection, classification and association problems with a unified extractive framework, namely UniEX. UniEX can synchronously encode schema-based prompt and textual information, and collaboratively learn the generalized knowledge from pre-defined information using the auto-encoder language models. We develop a traffine attention mechanism to integrate heterogeneous factors including tasks, labels and inside tokens, and obtain the extraction target via a scoring matrix. Experiment results show that UniEX can outperform generative universal IE models in terms of performance and inference-speed on $14$ benchmarks IE datasets with the supervised setting. The state-of-the-art performance in low-resource scenarios also verifies the transferability and effectiveness of UniEX.

Via

Access Paper or Ask Questions

Flat Multi-modal Interaction Transformer for Named Entity Recognition

Aug 23, 2022

Junyu Lu, Dixiang Zhang, Pingjian Zhang

Figure 1 for Flat Multi-modal Interaction Transformer for Named Entity Recognition

Figure 2 for Flat Multi-modal Interaction Transformer for Named Entity Recognition

Figure 3 for Flat Multi-modal Interaction Transformer for Named Entity Recognition

Figure 4 for Flat Multi-modal Interaction Transformer for Named Entity Recognition

Abstract:Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.

* Accepted by COLING 2022, oral paper

Via

Access Paper or Ask Questions

Entity Candidate Network for Whole-Aware Named Entity Recognition

Apr 29, 2020

Wendong He, Yizhen Shao, Pingjian Zhang

Figure 1 for Entity Candidate Network for Whole-Aware Named Entity Recognition

Figure 2 for Entity Candidate Network for Whole-Aware Named Entity Recognition

Figure 3 for Entity Candidate Network for Whole-Aware Named Entity Recognition

Figure 4 for Entity Candidate Network for Whole-Aware Named Entity Recognition

Abstract:Named Entity Recognition (NER) is a crucial upstream task in Natural Language Processing (NLP). Traditional tag scheme approaches offer a single recognition that does not meet the needs of many downstream tasks such as coreference resolution. Meanwhile, Tag scheme approaches ignore the continuity of entities. Inspired by one-stage object detection models in computer vision (CV), this paper proposes a new no-tag scheme, the Whole-Aware Detection, which makes NER an object detection task. Meanwhile, this paper presents a novel model, Entity Candidate Network (ECNet), and a specific convolution network, Adaptive Context Convolution Network (ACCN), to fuse multi-scale contexts and encode entity information at each position. ECNet identifies the full span of a named entity and its type at each position based on Entity Loss. Furthermore, ECNet is regulable between the highest precision and the highest recall, while the tag scheme approaches are not. Experimental results on the CoNLL 2003 English dataset and the WNUT 2017 dataset show that ECNet outperforms other previous state-of-the-art methods.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions