Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jefferson Hernandez

Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance

Mar 27, 2025

Jaywon Koo, Jefferson Hernandez, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

Abstract:Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fr\'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fr\'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.

Via

Access Paper or Ask Questions

Generative Visual Instruction Tuning

Jun 17, 2024

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

Figure 1 for Generative Visual Instruction Tuning

Figure 2 for Generative Visual Instruction Tuning

Figure 3 for Generative Visual Instruction Tuning

Figure 4 for Generative Visual Instruction Tuning

Abstract:We propose to use machine-generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language, and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pre-trained models through instruction finetuning: LLaMA for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities on par with LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

Via

Access Paper or Ask Questions

Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

Mar 21, 2023

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

Figure 1 for Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

Figure 2 for Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

Figure 3 for Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

Figure 4 for Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

Abstract:Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.

Via

Access Paper or Ask Questions

A fast multi-object tracking system using an object detector ensemble

Aug 06, 2019

Richard Cobos, Jefferson Hernandez, Andres G. Abad

Figure 1 for A fast multi-object tracking system using an object detector ensemble

Figure 2 for A fast multi-object tracking system using an object detector ensemble

Figure 3 for A fast multi-object tracking system using an object detector ensemble

Figure 4 for A fast multi-object tracking system using an object detector ensemble

Abstract:Multiple-Object Tracking (MOT) is of crucial importance for applications such as retail video analytics and video surveillance. Object detectors are often the computational bottleneck of modern MOT systems, limiting their use for real-time applications. In this paper, we address this issue by leveraging on an ensemble of detectors, each running every f frames. We measured the performance of our system in the MOT16 benchmark. The proposed model surpassed other online entries of the MOT16 challenge in speed, while maintaining an acceptable accuracy.

* 5 pages, 4 figures, 1 table, published in 2019 IEEE Colombian Conference on Applications in Computational Intelligence (ColCACI)

Via

Access Paper or Ask Questions

Learning from multivariate discrete sequential data using a restricted Boltzmann machine model

Apr 28, 2018

Jefferson Hernandez, Andres G. Abad

Figure 1 for Learning from multivariate discrete sequential data using a restricted Boltzmann machine model

Figure 2 for Learning from multivariate discrete sequential data using a restricted Boltzmann machine model

Figure 3 for Learning from multivariate discrete sequential data using a restricted Boltzmann machine model

Figure 4 for Learning from multivariate discrete sequential data using a restricted Boltzmann machine model

Abstract:A restricted Boltzmann machine (RBM) is a generative neural-network model with many novel applications such as collaborative filtering and acoustic modeling. An RBM lacks the capacity to retain memory, making it inappropriate for dynamic data modeling as in time-series analysis. In this paper we address this issue by proposing the p-RBM model, a generalization of the regular RBM model, capable of retaining memory of p past states. We further show how to train the p-RBM model using contrastive divergence and test our model on the problem of predicting the stock market direction considering 100 stocks of the NASDAQ-100 index. Obtained results show that the p-RBM offer promising prediction potential.

* 6 pages, 3 figures, Accepted as conference paper in IEEE-COLCACI 2018

Via

Access Paper or Ask Questions