Abstract:Vision-and-Language models such as CLIP have demonstrated remarkable effectiveness across a wide range of tasks. However, these models are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concern in their adoption. To overcome these limitations, we introduce a methodology to make Vision-and-Language models safer by removing their sensitivity to not-safe-for-work concepts. We show how this can be done by distilling from a large language model which converts between safe and unsafe sentences and which is fine-tuned starting from just 100 manually-curated pairs. We conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, where we show that our model can also be properly employed with pre-trained image generators. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.
Abstract:In recent years, there has been a growing trend of using data-driven methods in industrial settings. These kinds of methods often process video images or parts, therefore the integrity of such images is crucial. Sometimes datasets, e.g. consisting of images, can be sophisticated for various reasons. It becomes critical to understand how the manipulation of video and images can impact the effectiveness of a machine learning method. Our case study aims precisely to analyze the Linemod dataset, considered the state of the art in 6D pose estimation context. That dataset presents images accompanied by ArUco markers; it is evident that such markers will not be available in real-world contexts. We analyze how the presence of the markers affects the pose estimation accuracy, and how this bias may be mitigated through data augmentation and other methods. Our work aims to show how the presence of these markers goes to modify, in the testing phase, the effectiveness of the deep learning method used. In particular, we will demonstrate, through the tool of saliency maps, how the focus of the neural network is captured in part by these ArUco markers. Finally, a new dataset, obtained by applying geometric tools to Linemod, will be proposed in order to demonstrate our hypothesis and uncovering the bias. Our results demonstrate the potential for bias in 6DOF pose estimation networks, and suggest methods for reducing this bias when training with markers.