Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Efi Kokiopoulou

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Nov 17, 2025

Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou, Pascal Frossard

Figure 1 for Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Figure 2 for Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Figure 3 for Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Figure 4 for Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Abstract:Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

Via

Access Paper or Ask Questions

Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

Oct 31, 2025

Riccardo Brioschi, Aleksandr Alekseev, Emanuele Nevali, Berkay Döner, Omar El Malki, Blagoj Mitrevski, Leandro Kieliger, Mark Collier, Andrii Maksai, Jesse Berent(+2 more)

Abstract:Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.

* 15 pages, 18 figures, GitHub link: https://github.com/google-deepmind/sketch_to_layout, accept at ICCV 2025 Workshop (HiGen)

Via

Access Paper or Ask Questions

Representing Online Handwriting for Recognition in Large Vision-Language Models

Feb 23, 2024

Anastasiia Fadeeva, Philippe Schlattner, Andrii Maksai, Mark Collier, Efi Kokiopoulou, Jesse Berent, Claudiu Musat

Figure 1 for Representing Online Handwriting for Recognition in Large Vision-Language Models

Figure 2 for Representing Online Handwriting for Recognition in Large Vision-Language Models

Figure 3 for Representing Online Handwriting for Recognition in Large Vision-Language Models

Figure 4 for Representing Online Handwriting for Recognition in Large Vision-Language Models

Abstract:The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.

Via

Access Paper or Ask Questions

Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels

Oct 10, 2023

Ke Wang, Guillermo Ortiz-Jimenez, Rodolphe Jenatton, Mark Collier, Efi Kokiopoulou, Pascal Frossard

Abstract:Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e.g., +6.8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.

Via

Access Paper or Ask Questions

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

May 29, 2023

Jannik Kossen, Mark Collier, Basil Mustafa, Xiao Wang, Xiaohua Zhai, Lucas Beyer, Andreas Steiner, Jesse Berent, Rodolphe Jenatton, Efi Kokiopoulou

Figure 1 for Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Figure 2 for Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Figure 3 for Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Figure 4 for Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Abstract:We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits of contrastively training the image tower. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining.

Via

Access Paper or Ask Questions

SmartChoices: Augmenting Software with Learned Implementations

Apr 12, 2023

Daniel Golovin, Gabor Bartok, Eric Chen, Emily Donahue, Tzu-Kuo Huang, Efi Kokiopoulou, Ruoyan Qin, Nikhil Sarda, Justin Sybrandt, Vincent Tjeng

Figure 1 for SmartChoices: Augmenting Software with Learned Implementations

Figure 2 for SmartChoices: Augmenting Software with Learned Implementations

Figure 3 for SmartChoices: Augmenting Software with Learned Implementations

Figure 4 for SmartChoices: Augmenting Software with Learned Implementations

Abstract:We are living in a golden age of machine learning. Powerful models are being trained to perform many tasks far better than is possible using traditional software engineering approaches alone. However, developing and deploying those models in existing software systems remains difficult. In this paper we present SmartChoices, a novel approach to incorporating machine learning into mature software stacks easily, safely, and effectively. We explain the overall design philosophy and present case studies using SmartChoices within large scale industrial systems.

Via

Access Paper or Ask Questions

Transfer and Marginalize: Explaining Away Label Noise with Privileged Information

Feb 18, 2022

Mark Collier, Rodolphe Jenatton, Efi Kokiopoulou, Jesse Berent

Figure 1 for Transfer and Marginalize: Explaining Away Label Noise with Privileged Information

Figure 2 for Transfer and Marginalize: Explaining Away Label Noise with Privileged Information

Figure 3 for Transfer and Marginalize: Explaining Away Label Noise with Privileged Information

Figure 4 for Transfer and Marginalize: Explaining Away Label Noise with Privileged Information

Abstract:Supervised learning datasets often have privileged information, in the form of features which are available at training time but are not available at test time e.g. the ID of the annotator that provided the label. We argue that privileged information is useful for explaining away label noise, thereby reducing the harmful impact of noisy labels. We develop a simple and efficient method for supervised neural networks: it transfers via weight sharing the knowledge learned with privileged information and approximately marginalizes over privileged information at test time. Our method, TRAM (TRansfer and Marginalize), has minimal training time overhead and has the same test time cost as not using privileged information. TRAM performs strongly on CIFAR-10H, ImageNet and Civil Comments benchmarks.

Via

Access Paper or Ask Questions

Correlated Input-Dependent Label Noise in Large-Scale Image Classification

May 19, 2021

Mark Collier, Basil Mustafa, Efi Kokiopoulou, Rodolphe Jenatton, Jesse Berent

Figure 1 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Figure 2 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Figure 3 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Figure 4 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Abstract:Large scale image classification datasets often contain noisy labels. We take a principled probabilistic approach to modelling input-dependent, also known as heteroscedastic, label noise in these datasets. We place a multivariate Normal distributed latent variable on the final hidden layer of a neural network classifier. The covariance matrix of this latent variable, models the aleatoric uncertainty due to label noise. We demonstrate that the learned covariance structure captures known sources of label noise between semantically similar and co-occurring classes. Compared to standard neural network training and other baselines, we show significantly improved accuracy on Imagenet ILSVRC 2012 79.3% (+2.6%), Imagenet-21k 47.0% (+1.1%) and JFT 64.7% (+1.6%). We set a new state-of-the-art result on WebVision 1.0 with 76.6% top-1 accuracy. These datasets range from over 1M to over 300M training examples and from 1k classes to more than 21k classes. Our method is simple to use, and we provide an implementation that is a drop-in replacement for the final fully-connected layer in a deep classifier.

* Accepted as Oral at CVPR 2021

Via

Access Paper or Ask Questions

Routing Networks with Co-training for Continual Learning

Sep 09, 2020

Mark Collier, Efi Kokiopoulou, Andrea Gesmundo, Jesse Berent

Figure 1 for Routing Networks with Co-training for Continual Learning

Figure 2 for Routing Networks with Co-training for Continual Learning

Figure 3 for Routing Networks with Co-training for Continual Learning

Figure 4 for Routing Networks with Co-training for Continual Learning

Abstract:The core challenge with continual learning is catastrophic forgetting, the phenomenon that when neural networks are trained on a sequence of tasks they rapidly forget previously learned tasks. It has been observed that catastrophic forgetting is most severe when tasks are dissimilar to each other. We propose the use of sparse routing networks for continual learning. For each input, these network architectures activate a different path through a network of experts. Routing networks have been shown to learn to route similar tasks to overlapping sets of experts and dissimilar tasks to disjoint sets of experts. In the continual learning context this behaviour is desirable as it minimizes interference between dissimilar tasks while allowing positive transfer between related tasks. In practice, we find it is necessary to develop a new training method for routing networks, which we call co-training which avoids poorly initialized experts when new tasks are presented. When combined with a small episodic memory replay buffer, sparse routing networks with co-training outperform densely connected networks on the MNIST-Permutations and MNIST-Rotations benchmarks.

* Presented at ICML Workshop on Continual Learning 2020

Via

Access Paper or Ask Questions

Analysis of Softmax Approximation for Deep Classifiers under Input-Dependent Label Noise

Mar 15, 2020

Mark Collier, Basil Mustafa, Efi Kokiopoulou, Jesse Berent

Figure 1 for Analysis of Softmax Approximation for Deep Classifiers under Input-Dependent Label Noise

Figure 2 for Analysis of Softmax Approximation for Deep Classifiers under Input-Dependent Label Noise

Figure 3 for Analysis of Softmax Approximation for Deep Classifiers under Input-Dependent Label Noise

Figure 4 for Analysis of Softmax Approximation for Deep Classifiers under Input-Dependent Label Noise

Abstract:Modelling uncertainty arising from input-dependent label noise is an increasingly important problem. A state-of-the-art approach for classification [Kendall and Gal, 2017] places a normal distribution over the softmax logits, where the mean and variance of this distribution are learned functions of the inputs. This approach achieves impressive empirical performance but lacks theoretical justification. We show that this model is a special case of a well known and theoretically understood model studied in econometrics. Under this view the softmax over the logit distribution is a smooth approximation to an argmax, where the approximation is exact in the zero temperature limit. We further illustrate that the softmax temperature controls a bias-variance trade-off and the optimal point on this trade-off is not always found at 1.0. By tuning the softmax temperature, we achieve improved performance on well known image classification benchmarks with controlled label noise. For image segmentation, where input-dependent label noise naturally arises, we show that tuning the temperature increases the mean IoU on the PASCAL VOC and Cityscapes datasets by more than 1% over the state-of-the-art model and a strong baseline that does not model this noise source.

Via

Access Paper or Ask Questions