Abstract:Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs' visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.
Abstract:Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensive analysis of two recent T2I models. Exploring compound prompts, we find that complex scenes describing multiple objects are composed progressively and more slowly compared to simple scenes; Exploring knowledge retrieval, we find that representation of uncommon concepts requires further computation compared to common concepts, and that knowledge retrieval is gradual across layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines.
Abstract:Text-To-Image (TTI) models, exemplified by DALL-E and StableDiffusion, have recently gained prominence for their remarkable zero-shot capabilities in generating images guided by textual prompts. Language, as a conduit of culture, plays a pivotal role in these models' multilingual capabilities, which in turn shape their cultural agency. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model, and human assessments, to discern TTI cultural perceptions. To facilitate our research, we introduce the CulText2I dataset, derived from four diverse TTI models and spanning ten languages. Our experiments reveal insights into these models' cultural awareness, cultural distinctions, and the unlocking of cultural features, releasing the potential for cross-cultural applications.
Abstract:Even for us, it can be challenging to comprehend the meaning of songs. As part of this project, we explore the process of generating the meaning of songs. Despite the widespread use of text-to-text models, few attempts have been made to achieve a similar objective. Songs are primarily studied in the context of sentiment analysis. This involves identifying opinions and emotions in texts, evaluating them as positive or negative, and utilizing these evaluations to make music recommendations. In this paper, we present a generative model that offers implicit meanings for several lines of a song. Our model uses a decoder Transformer architecture GPT-2, where the input is the lyrics of a song. Furthermore, we compared the performance of this architecture with that of the encoder-decoder Transformer architecture of the T5 model. We also examined the effect of different prompt types with the option of appending additional information, such as the name of the artist and the title of the song. Moreover, we tested different decoding methods with different training parameters and evaluated our results using ROUGE. In order to build our dataset, we utilized the 'Genious' API, which allowed us to acquire the lyrics of songs and their explanations, as well as their rich metadata.
Abstract:The staleness problem is a well-known problem when working with dynamic data, due to the absence of events for a long time. Since the memory of the node is updated only when the node is involved in an event, its memory becomes stale. Usually, it refers to a lack of events such as a temporal deactivation of a social account. To overcome the memory staleness problem aggregate information from the nodes neighbors memory in addition to the nodes memory. Inspired by that, we design an updated embedding module that inserts the most similar node in addition to the nodes neighbors. Our method achieved similar results to the TGN, with a slight improvement. This could indicate a potential improvement after fine-tuning our hyper-parameters, especially the time threshold, and using a learnable similarity metric.