Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Niccolò Biondi

ComiCap: A VLMs pipeline for dense captioning of Comic Panels

Sep 24, 2024

Emanuele Vivoli, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

Abstract:The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test set to fairly evaluate open-source VLMs and select the best captioning model according to our metric. Our pipeline generates dense captions with bounding boxes that are quantitatively and qualitatively superior to those produced by specifically trained models, without requiring any additional training. Using this pipeline, we annotated over 2 million panels across 13,000 books, which will be available on the project page https://github.com/emanuelevivoli/ComiCap.

* Accepted at ECCV 2024 Workshop (AI for Visual Art), repo: https://github.com/emanuelevivoli/ComiCap

Via

Access Paper or Ask Questions

Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer

Aug 16, 2024

Simone Ricci, Niccolò Biondi, Federico Pernici, Alberto Del Bimbo

Figure 1 for Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer

Figure 2 for Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer

Figure 3 for Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer

Figure 4 for Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer

Abstract:Visual retrieval systems face significant challenges when updating models with improved representations due to misalignment between the old and new representations. The costly and resource-intensive backfilling process involves recalculating feature vectors for images in the gallery set whenever a new model is introduced. To address this, prior research has explored backward-compatible training methods that enable direct comparisons between new and old representations without backfilling. Despite these advancements, achieving a balance between backward compatibility and the performance of independently trained models remains an open problem. In this paper, we address it by expanding the representation space with additional dimensions and learning an orthogonal transformation to achieve compatibility with old models and, at the same time, integrate new information. This transformation preserves the original feature space's geometry, ensuring that our model aligns with previous versions while also learning new data. Our Orthogonal Compatible Aligned (OCA) approach eliminates the need for re-indexing during model updates and ensures that features can be compared directly across different model updates without additional mapping functions. Experimental results on CIFAR-100 and ImageNet-1k demonstrate that our method not only maintains compatibility with previous models but also achieves state-of-the-art accuracy, outperforming several existing methods.

* Accepted at BEW2024 Workshop at ECCV2024

Via

Access Paper or Ask Questions

Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

Jul 03, 2024

Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

Abstract:Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compared due to varying train/test splits and metrics. To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. Our proposed Comics Datasets Framework standardizes dataset annotations into a common format and addresses the overrepresentation of manga by introducing Comics100, a curated collection of 100 books from the Digital Comics Museum, annotated for detection in our uniform format. We have benchmarked a variety of detection architectures using the Comics Datasets Framework. All related code, model weights, and detailed evaluation processes are available at https://github.com/emanuelevivoli/cdf, ensuring transparency and facilitating replication. This initiative is a significant advancement towards improving object detection in comics, laying the groundwork for more complex computational tasks dependent on precise object recognition.

* Accepted at MANPU - COMICS workshop at ICDAR

Via

Access Paper or Ask Questions

Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements

May 04, 2024

Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

Figure 1 for Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements

Figure 2 for Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements

Figure 3 for Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements

Figure 4 for Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements

Abstract:Learning compatible representations enables the interchangeable use of semantic features as models are updated over time. This is particularly relevant in search and retrieval systems where it is crucial to avoid reprocessing of the gallery images with the updated model. While recent research has shown promising empirical evidence, there is still a lack of comprehensive theoretical understanding about learning compatible representations. In this paper, we demonstrate that the stationary representations learned by the $d$-Simplex fixed classifier optimally approximate compatibility representation according to the two inequality constraints of its formal definition. This not only establishes a solid foundation for future works in this line of research but also presents implications that can be exploited in practical learning scenarios. An exemplary application is the now-standard practice of downloading and fine-tuning new pre-trained models. Specifically, we show the strengths and critical issues of stationary representations in the case in which a model undergoing sequential fine-tuning is asynchronously replaced by downloading a better-performing model pre-trained elsewhere. Such a representation enables seamless delivery of retrieval service (i.e., no reprocessing of gallery images) and offers improved performance without operational disruptions during model replacement. Code available at: https://github.com/miccunifi/iamcl2r.

* Accepted at CVPR24 as Poster Highlight

Via

Access Paper or Ask Questions