Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiayi Lei

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Jan 23, 2025

Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li(+5 more)

Figure 1 for IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Figure 2 for IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Figure 3 for IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Figure 4 for IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Abstract:With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.

* 75 pages, 73 figures, Evaluation scripts: https://github.com/jylei16/Imagine-e

Via

Access Paper or Ask Questions

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Apr 24, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu(+12 more)

Figure 1 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Figure 2 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Figure 3 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Figure 4 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Abstract:Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

* 77 pages, 41 figures

Via

Access Paper or Ask Questions

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Sep 10, 2023

Lingyue Fu, Huacan Chai, Shuang Luo, Kounianhua Du, Weiming Zhang, Longteng Fan, Jiayi Lei, Renting Rui, Jianghao Lin, Yuchen Fang(+6 more)

Figure 1 for CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Figure 2 for CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Figure 3 for CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Figure 4 for CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Abstract:With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. We propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension and code generation abilities of LLMs. CodeApex comprises three types of multiple-choice questions: conceptual understanding, commonsense reasoning, and multi-hop reasoning, designed to evaluate LLMs on programming comprehension tasks. Additionally, CodeApex utilizes algorithmic questions and corresponding test cases to assess the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs, including both general-purpose and specialized models. GPT exhibits the best programming capabilities, achieving approximate accuracies of 50% and 56% on the two tasks, respectively. There is still significant room for improvement in programming tasks. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git. CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.

* 21 pages

Via

Access Paper or Ask Questions

NOMA for STAR-RIS Assisted UAV Networks

Jul 05, 2023

Jiayi Lei, Tiankui Zhang, Xidong Mu, Yuanwei Liu

Abstract:This paper proposes a novel simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted unmanned aerial vehicle (UAV) non-orthogonal multiple access (NOMA) emergency communication network. Multiple STAR-RISs are deployed to provide additional and intelligent transmission links between trapped users and UAV-mounted base station (BS). Each user selects the nearest STAR-RIS for uploading data, and NOMA is employed for users located at the same side of the same STAR-RIS. Considering piratical requirements of post-disaster emergency communications, we formulate a throughput maximization problem subject to constraints on minimum average rate and maximum energy consumption, where the UAV trajectory, STAR-RIS passive beamforming, and time and power allocation are jointly optimized. Furthermore, we propose a Lagrange based reward constrained proximal policy optimization (LRCPPO) algorithm, which provides an adaptive method for solving the long-term optimization problem with cumulative constraints. Specifically, using Lagrange relaxation, the original problem is transformed into an unconstrained problem with a two-layer structure. The inner layer is solved by penalized reward based proximal policy optimization (PPO) algorithm. In the outer layer, Lagrange multipliers are updated by gradient descent. Numerical results show the proposed algorithm can effectively improve network performance while satisfying the constraints well. It also demonstrates the superiority of the proposed STAR-RIS assisted UAV NOMA network architecture over the benchmark schemes employing reflecting-only RISs and orthogonal multiple access.

Via

Access Paper or Ask Questions

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

May 17, 2023

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei(+3 more)

Figure 1 for C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Figure 2 for C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Figure 3 for C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Figure 4 for C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Abstract:New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.

* Website: https://cevalbenchmark.com

Via

Access Paper or Ask Questions