Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Morris Alper

ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline

Aug 08, 2025

Morris Alper, Moran Yanuka, Raja Giryes, Gašper Beguš

Abstract:Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, large-scale foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages -- phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs' meta-linguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring coherence and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.

* Project page: https://conlangcrafter.github.io

Via

Access Paper or Ask Questions

WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild

Jun 16, 2025

Morris Alper, David Novotny, Filippos Kokkinos, Hadar Averbuch-Elor, Tom Monnier

Abstract:Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data captured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly less data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.

* Project page: https://wildcat3d.github.io

Via

Access Paper or Ask Questions

Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech

Jun 14, 2025

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

Abstract:Real-time text-to-speech (TTS) for Modern Hebrew is challenging due to the language's orthographic complexity. Existing solutions ignore crucial phonetic features such as stress that remain underspecified even when vowel marks are added. To address these limitations, we introduce Phonikud, a lightweight, open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified IPA transcriptions. Our approach adapts an existing diacritization model with lightweight adaptors, incurring negligible additional latency. We also contribute the ILSpeech dataset of transcribed Hebrew speech with IPA annotations, serving as a benchmark for Hebrew G2P and as training data for TTS systems. Our results demonstrate that Phonikud G2P conversion more accurately predicts phonemes from Hebrew text compared to prior methods, and that this enables training of effective real-time Hebrew TTS models with superior speed-accuracy trade-offs. We release our code, data, and models at https://phonikud.github.io.

* Project page: https://phonikud.github.io

Via

Access Paper or Ask Questions

Dynamic Scene Understanding from Vision-Language Representations

Jan 20, 2025

Shahaf Pruss, Morris Alper, Hadar Averbuch-Elor

Figure 1 for Dynamic Scene Understanding from Vision-Language Representations

Figure 2 for Dynamic Scene Understanding from Vision-Language Representations

Figure 3 for Dynamic Scene Understanding from Vision-Language Representations

Figure 4 for Dynamic Scene Understanding from Vision-Language Representations

Abstract:Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.

Via

Access Paper or Ask Questions

WAFFLE: Multimodal Floorplan Understanding in the Wild

Dec 01, 2024

Keren Ganon, Morris Alper, Rachel Mikulinsky, Hadar Averbuch-Elor

Figure 1 for WAFFLE: Multimodal Floorplan Understanding in the Wild

Figure 2 for WAFFLE: Multimodal Floorplan Understanding in the Wild

Figure 3 for WAFFLE: Multimodal Floorplan Understanding in the Wild

Figure 4 for WAFFLE: Multimodal Floorplan Understanding in the Wild

Abstract:Buildings are a central feature of human culture and are increasingly being analyzed with computational methods. However, recent works on computational building understanding have largely focused on natural imagery of buildings, neglecting the fundamental element defining a building's structure -- its floorplan. Conversely, existing works on floorplan understanding are extremely limited in scope, often focusing on floorplans of a single semantic category and region (e.g. floorplans of apartments from a single country). In this work, we introduce WAFFLE, a novel multimodal floorplan understanding dataset of nearly 20K floorplan images and metadata curated from Internet data spanning diverse building types, locations, and data formats. By using a large language model and multimodal foundation models, we curate and extract semantic information from these images and their accompanying noisy metadata. We show that WAFFLE enables progress on new building understanding tasks, both discriminative and generative, which were not feasible using prior datasets. We will publicly release WAFFLE along with our code and trained models, providing the research community with a new foundation for learning the semantics of buildings.

Via

Access Paper or Ask Questions

Emergent Visual-Semantic Hierarchies in Image-Text Representations

Jul 11, 2024

Morris Alper, Hadar Averbuch-Elor

Figure 1 for Emergent Visual-Semantic Hierarchies in Image-Text Representations

Figure 2 for Emergent Visual-Semantic Hierarchies in Image-Text Representations

Figure 3 for Emergent Visual-Semantic Hierarchies in Image-Text Representations

Figure 4 for Emergent Visual-Semantic Hierarchies in Image-Text Representations

Abstract:While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.

* Accepted to ECCV 2024. Project page: https://hierarcaps.github.io/

Via

Access Paper or Ask Questions

ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Mar 02, 2024

Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes

Abstract:Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.

Via

Access Paper or Ask Questions

MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Dec 06, 2023

Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor

Figure 1 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Figure 2 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Figure 3 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Figure 4 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Abstract:While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.

* Website Link: https://assafbk.github.io/mocha/

Via

Access Paper or Ask Questions

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Oct 25, 2023

Morris Alper, Hadar Averbuch-Elor

Abstract:Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.

* Accepted to NeurIPS 2023 (spotlight). Project webpage: https://kiki-bouba.github.io/

Via

Access Paper or Ask Questions

Learning Human-Human Interactions in Images from Weak Textual Supervision

Apr 27, 2023

Morris Alper, Hadar Averbuch-Elor

Abstract:Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.

* Project webpage: https://learning-interactions.github.io

Via

Access Paper or Ask Questions