Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bingbing Wen

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

May 23, 2025

Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov(+2 more)

Abstract:Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

Via

Access Paper or Ask Questions

Know Your Limits: A Survey of Abstention in Large Language Models

Aug 08, 2024

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, Lucy Lu Wang

Figure 1 for Know Your Limits: A Survey of Abstention in Large Language Models

Figure 2 for Know Your Limits: A Survey of Abstention in Large Language Models

Figure 3 for Know Your Limits: A Survey of Abstention in Large Language Models

Figure 4 for Know Your Limits: A Survey of Abstention in Large Language Models

Abstract:Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics using this framework, and discuss merits and limitations of prior work. We further identify and motivate areas for future work, centered around whether abstention can be achieved as a meta-capability that transcends specific tasks or domains, while still providing opportunities to optimize abstention abilities based on context.

* preprint

Via

Access Paper or Ask Questions

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Jul 29, 2024

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

Abstract:To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the optimal data mixtures when scaling up to the final model. To address this challenge, we propose *AutoScale*, an automated tool that finds a compute-optimal data composition for training at any desired target scale. AutoScale first determines the optimal composition at a small scale using a novel bilevel optimization framework, Direct Data Optimization (*DDO*), and then fits a predictor to estimate the optimal composition at larger scales. The predictor's design is inspired by our theoretical analysis of scaling laws related to data composition, which could be of independent interest. In empirical studies with pre-training 774M Decoder-only LMs (GPT-2 Large) on RedPajama dataset, AutoScale decreases validation perplexity at least 25% faster than any baseline with up to 38% speed up compared to without reweighting, achieving the best overall performance across downstream tasks. On pre-training Encoder-only LMs (BERT) with masked language modeling, DDO is shown to decrease loss on all domains while visibly improving average task performance on GLUE benchmark by 8.7% and on large-scale QA dataset (SQuAD) by 5.9% compared with without reweighting. AutoScale speeds up training by up to 28%. Our codes are open-sourced.

Via

Access Paper or Ask Questions

The Art of Refusal: A Survey of Abstention in Large Language Models

Jul 25, 2024

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, Lucy Lu Wang

Figure 1 for The Art of Refusal: A Survey of Abstention in Large Language Models

Figure 2 for The Art of Refusal: A Survey of Abstention in Large Language Models

Figure 3 for The Art of Refusal: A Survey of Abstention in Large Language Models

Figure 4 for The Art of Refusal: A Survey of Abstention in Large Language Models

Abstract:Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in building LLM systems. In this survey, we introduce a framework to examine abstention behavior from three perspectives: the query, the model, and human values. We review the literature on abstention methods (categorized based on the development stages of LLMs), benchmarks, and evaluation metrics, and discuss the merits and limitations of prior work. We further identify and motivate areas for future research, such as encouraging the study of abstention as a meta-capability across tasks and customizing abstention abilities based on context. In doing so, we aim to broaden the scope and impact of abstention methodologies in AI systems.

* preprint

Via

Access Paper or Ask Questions

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

May 27, 2024

Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber(+1 more)

Figure 1 for Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Figure 2 for Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Figure 3 for Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Figure 4 for Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Abstract:The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-weight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings.

* Accepted at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2024

Via

Access Paper or Ask Questions

Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

Apr 18, 2024

Bingbing Wen, Bill Howe, Lucy Lu Wang

Abstract:The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, and providing additional context beyond what is given. In experiments on four QA datasets with four LLMs, we show that performance varies greatly across models, across the type of context provided, and also by question type; in particular, many LLMs seem unable to abstain from answering boolean questions using standard QA prompts. Our analysis also highlights the unexpected impact of abstention performance on QA task accuracy. Counter-intuitively, in some settings, replacing gold context with irrelevant context or adding irrelevant context to gold context can improve abstention performance in a way that results in improvements in task performance. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention.

Via

Access Paper or Ask Questions

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

Dec 21, 2023

Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, Lijuan Wang

Abstract:In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.

Via

Access Paper or Ask Questions

OmniMotionGPT: Animal Motion Generation with Limited Data

Nov 30, 2023

Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan, Bingbing Wen, Ziwei Xuan, Mitch Hill, Junjie Bai, Guo-Jun Qi, Yalin Wang

Figure 1 for OmniMotionGPT: Animal Motion Generation with Limited Data

Figure 2 for OmniMotionGPT: Animal Motion Generation with Limited Data

Figure 3 for OmniMotionGPT: Animal Motion Generation with Limited Data

Figure 4 for OmniMotionGPT: Animal Motion Generation with Limited Data

Abstract:Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked, it remains challenging to transfer this success to other skeleton structures with limited data. In this work, we design a model architecture that imitates Generative Pretraining Transformer (GPT), utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding, animal motion encoding, and text CLIP embedding. Presenting the first solution to this problem, we are able to generate animal motions with high diversity and fidelity, quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally, we introduce AnimalML3D, the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation, providing a new playground for the research community.

* The project page is at https://zshyang.github.io/omgpt-website/

Via

Access Paper or Ask Questions

EGCR: Explanation Generation for Conversational Recommendation

Aug 18, 2022

Bingbing Wen, Xiaoning Bu, Chirag Shah

Figure 1 for EGCR: Explanation Generation for Conversational Recommendation

Figure 2 for EGCR: Explanation Generation for Conversational Recommendation

Figure 3 for EGCR: Explanation Generation for Conversational Recommendation

Figure 4 for EGCR: Explanation Generation for Conversational Recommendation

Abstract:Growing attention has been paid in Conversational Recommendation System (CRS), which works as a conversation-based and recommendation task-oriented tool to provide items of interest and explore user preference. However, existing work in CRS fails to explicitly show the reasoning logic to users and the whole CRS still remains a black box. Therefore we propose a novel end-to-end framework named Explanation Generation for Conversational Recommendation (EGCR) based on generating explanations for conversational agents to explain why they make the action. EGCR incorporates user reviews to enhance the item representation and increase the informativeness of the whole conversation. To the best of our knowledge, this is the first framework for explainable conversational recommendation on real-world datasets. Moreover, we evaluate EGCR on one benchmark conversational recommendation datasets and achieve better performance on both recommendation accuracy and conversation quality than other state-of-the art models. Finally, extensive experiments demonstrate that generated explanations are not only having high quality and explainability, but also making CRS more trustworthy. We will make our code available to contribute to the CRS community

Via

Access Paper or Ask Questions

Towards Generating Robust, Fair, and Emotion-Aware Explanations for Recommender Systems

Aug 17, 2022

Bingbing Wen, Yunhe Feng, Yongfeng Zhang, Chirag Shah

Figure 1 for Towards Generating Robust, Fair, and Emotion-Aware Explanations for Recommender Systems

Figure 2 for Towards Generating Robust, Fair, and Emotion-Aware Explanations for Recommender Systems

Figure 3 for Towards Generating Robust, Fair, and Emotion-Aware Explanations for Recommender Systems

Figure 4 for Towards Generating Robust, Fair, and Emotion-Aware Explanations for Recommender Systems

Abstract:As recommender systems become increasingly sophisticated and complex, they often suffer from lack of fairness and transparency. Providing robust and unbiased explanations for recommendations has been drawing more and more attention as it can help address these issues and improve trustworthiness and informativeness of recommender systems. However, despite the fact that such explanations are generated for humans who respond more strongly to messages with appropriate emotions, there is a lack of consideration for emotions when generating explanations for recommendations. Current explanation generation models are found to exaggerate certain emotions without accurately capturing the underlying tone or the meaning. In this paper, we propose a novel method based on a multi-head transformer, called Emotion-aware Transformer for Explainable Recommendation (EmoTER), to generate more robust, fair, and emotion-enhanced explanations. To measure the linguistic quality and emotion fairness of the generated explanations, we adopt both automatic text metrics and human perceptions for evaluation. Experiments on three widely-used benchmark datasets with multiple evaluation metrics demonstrate that EmoTER consistently outperforms the existing state-of-the-art explanation generation models in terms of text quality, explainability, and consideration for fairness to emotion distribution. Implementation of EmoTER will be released as an open-source toolkit to support further research.

Via

Access Paper or Ask Questions