Abstract:While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.
Abstract:This study investigates gender bias in large language models (LLMs) by comparing their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark. We created a new evaluation set using occupational data and role-specific sentences. Unlike common benchmarks included in LLM training data, our set is newly developed, preventing data leakage and test set contamination. Five LLMs were tested to predict the gender for each role using single-word answers. We used Kullback-Leibler (KL) divergence to compare model outputs with human perceptions, statistical data, and the 50% neutrality benchmark. All LLMs showed significant deviation from gender neutrality and aligned more with statistical data, still reflecting inherent biases.