Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Z. Kaplan

Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Jan 16, 2025

Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish

Figure 1 for Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Figure 2 for Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Figure 3 for Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Figure 4 for Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Abstract:The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

Via

Access Paper or Ask Questions

SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Aug 14, 2024

Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

Figure 1 for SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Figure 2 for SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Figure 3 for SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Figure 4 for SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Abstract:Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.

* Accepted at INTERSPEECH 2024

Via

Access Paper or Ask Questions

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Jan 21, 2024

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole

Abstract:We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

* 20 pages, 13 figures, project page and code available at https://crowsonkb.github.io/hourglass-diffusion-transformers/

Via

Access Paper or Ask Questions