Abstract:Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' by an LLM, primarily via converting their samples into a sequence of embedded language-like tokens directly fed into the LLM (decoder) input stream. However, so far limited attention has been given to transferring (and evaluating) one of the core LLM capabilities to the emerging VLMs, namely the In-Context Learning (ICL) ability, or in other words to guide VLMs to desired target downstream tasks or output structure using in-context image+text demonstrations. In this work, we dive deeper into analyzing the capabilities of some of the state-of-the-art VLMs to follow ICL instructions, discovering them to be somewhat lacking. We discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot (ICL) demonstrations, likely due to their lack of `direct' ICL instruction tuning. To test this conjecture, we propose a simple, yet surprisingly effective, strategy of extending a common VLM alignment framework with ICL support, methodology, and curriculum. We explore, analyze, and provide insights into effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. We also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.
Abstract:Standard breast cancer screening involves the acquisition of two mammography X-ray projections for each breast. Typically, a comparison of both views supports the challenging task of tumor detection and localization. We introduce a deep learning, patch-based Siamese network for lesion matching in dual-view mammography. Our locally-fitted approach generates a joint patch pair representation and comparison with a shared configuration between the two views. We performed a comprehensive set of experiments with the network on standard datasets, among them the large Digital Database for Screening Mammography (DDSM). We analyzed the effect of transfer learning with the network between different types of datasets and compared the network-based matching to using Euclidean distance by template matching. Finally, we evaluated the contribution of the matching network in a full detection pipeline. Experimental results demonstrate the promise of improved detection accuracy using our approach.