Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Neelabh Sinha

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Sep 14, 2024

Neelabh Sinha, Vinija Jain, Aman Chadha

Abstract:Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

* 8 pages + references + 6 pages of Appendix

Via

Access Paper or Ask Questions

Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Jun 17, 2024

Neelabh Sinha, Vinija Jain, Aman Chadha

Figure 1 for Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Figure 2 for Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Figure 3 for Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Figure 4 for Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Abstract:The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.

Via

Access Paper or Ask Questions

Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Dec 22, 2021

Tanay Agrawal, Dhruv Agarwal, Michal Balazia, Neelabh Sinha, Francois Bremond

Figure 1 for Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Figure 2 for Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Figure 3 for Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Figure 4 for Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Abstract:Personality computing and affective computing have gained recent interest in many research areas. The datasets for the task generally have multiple modalities like video, audio, language and bio-signals. In this paper, we propose a flexible model for the task which exploits all available data. The task involves complex relations and to avoid using a large model for video processing specifically, we propose the use of behaviour encoding which boosts performance with minimal change to the model. Cross-attention using transformers has become popular in recent times and is utilised for fusion of different modalities. Since long term relations may exist, breaking the input into chunks is not desirable, thus the proposed model processes the entire input together. Our experiments show the importance of each of the above contributions

* Preprint. Final paper accepted at the 17th International Conference on Computer Vision Theory and Applications, VISAPP 2021, Virtual, February 6-8, 2022. 8 pages

Via

Access Paper or Ask Questions

FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Oct 10, 2021

Neelabh Sinha, Michal Balazia, François Bremond

Figure 1 for FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Figure 2 for FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Figure 3 for FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Figure 4 for FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Abstract:3D gaze estimation is about predicting the line of sight of a person in 3D space. Person-independent models for the same lack precision due to anatomical differences of subjects, whereas person-specific calibrated techniques add strict constraints on scalability. To overcome these issues, we propose a novel technique, Facial Landmark Heatmap Activated Multimodal Gaze Estimation (FLAME), as a way of combining eye anatomical information using eye landmark heatmaps to obtain precise gaze estimation without any person-specific calibration. Our evaluation demonstrates a competitive performance of about 10% improvement on benchmark datasets ColumbiaGaze and EYEDIAP. We also conduct an ablation study to validate our method.

* Preprint. Final paper accepted at the 17th IEEE International Conference on Advanced Video and Signal-based Surveillance, AVSS 2021, Virtual, November 16-19, 2021. 8 pages

Via

Access Paper or Ask Questions