Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Claudiu Musat

Swisscom AG: Data Analytics & AI

InkFM: A Foundational Model for Full-Page Online Handwritten Note Understanding

Mar 29, 2025

Anastasiia Fadeeva, Vincent Coriou, Diego Antognini, Claudiu Musat, Andrii Maksai

Abstract:Tablets and styluses are increasingly popular for taking notes. To optimize this experience and ensure a smooth and efficient workflow, it's important to develop methods for accurately interpreting and understanding the content of handwritten digital notes. We introduce a foundational model called InkFM for analyzing full pages of handwritten content. Trained on a diverse mixture of tasks, this model offers a unique combination of capabilities: recognizing text in 28 different scripts, mathematical expressions recognition, and segmenting pages into distinct elements like text and drawings. Our results demonstrate that these tasks can be effectively unified within a single model, achieving SoTA text line segmentation out-of-the-box quality surpassing public baselines like docTR. Fine- or LoRA-tuning our base model on public datasets further improves the quality of page segmentation, achieves state-of the art text recognition (DeepWriting, CASIA, SCUT, and Mathwriting datasets) and sketch classification (QuickDraw). This adaptability of InkFM provides a powerful starting point for developing applications with handwritten input.

Via

Access Paper or Ask Questions

Representing Online Handwriting for Recognition in Large Vision-Language Models

Feb 23, 2024

Anastasiia Fadeeva, Philippe Schlattner, Andrii Maksai, Mark Collier, Efi Kokiopoulou, Jesse Berent, Claudiu Musat

Abstract:The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.

Via

Access Paper or Ask Questions

InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

Feb 21, 2024

Blagoj Mitrevski, Arina Rak, Julian Schnitzler, Chengkun Li, Andrii Maksai, Jesse Berent, Claudiu Musat

Figure 1 for InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

Figure 2 for InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

Figure 3 for InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

Figure 4 for InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

Abstract:Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority. Our work, InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as Derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human. Interactive visualizations of 100 word-level model outputs for each of the three public datasets are available in our Hugging Face space: https://huggingface.co/spaces/Derendering/Model-Output-Playground. Model release is in progress.

Via

Access Paper or Ask Questions

DSS: Synthesizing long Digital Ink using Data augmentation, Style encoding and Split generation

Nov 29, 2023

Aleksandr Timofeev, Anastasiia Fadeeva, Andrei Afonin, Claudiu Musat, Andrii Maksai

Abstract:As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.

* Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14190, pages 217-235, Springer, Cham

Via

Access Paper or Ask Questions

Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation

Sep 06, 2023

Michael Jungo, Beat Wolf, Andrii Maksai, Claudiu Musat, Andreas Fischer

Abstract:On-line handwritten character segmentation is often associated with handwriting recognition and even though recognition models include mechanisms to locate relevant positions during the recognition process, it is typically insufficient to produce a precise segmentation. Decoupling the segmentation from the recognition unlocks the potential to further utilize the result of the recognition. We specifically focus on the scenario where the transcription is known beforehand, in which case the character segmentation becomes an assignment problem between sampling points of the stylus trajectory and characters in the text. Inspired by the $k$-means clustering algorithm, we view it from the perspective of cluster assignment and present a Transformer-based architecture where each cluster is formed based on a learned character query in the Transformer decoder block. In order to assess the quality of our approach, we create character segmentation ground truths for two popular on-line handwriting datasets, IAM-OnDB and HANDS-VNOnDB, and evaluate multiple methods on them, demonstrating that our approach achieves the overall best results.

* International Conference on Document Analysis and Recognition - ICDAR 2023, pp. 98-114. Cham: Springer Nature Switzerland
* ICDAR 2023 Best Student Paper Award. Code available at https://github.com/jungomi/character-queries

Via

Access Paper or Ask Questions

Sampling and Ranking for Digital Ink Generation on a tight computational budget

Jun 02, 2023

Andrei Afonin, Andrii Maksai, Aleksandr Timofeev, Claudiu Musat

Figure 1 for Sampling and Ranking for Digital Ink Generation on a tight computational budget

Figure 2 for Sampling and Ranking for Digital Ink Generation on a tight computational budget

Figure 3 for Sampling and Ranking for Digital Ink Generation on a tight computational budget

Figure 4 for Sampling and Ranking for Digital Ink Generation on a tight computational budget

Abstract:Digital ink (online handwriting) generation has a number of potential applications for creating user-visible content, such as handwriting autocompletion, spelling correction, and beautification. Writing is personal and usually the processing is done on-device. Ink generative models thus need to produce high quality content quickly, in a resource constrained environment. In this work, we study ways to maximize the quality of the output of a trained digital ink generative model, while staying within an inference time budget. We use and compare the effect of multiple sampling and ranking techniques, in the first ablation study of its kind in the digital ink domain. We confirm our findings on multiple datasets - writing in English and Vietnamese, as well as mathematical formulas - using two model types and two common ink data representations. In all combinations, we report a meaningful improvement in the recognizability of the synthetic inks, in some cases more than halving the character error rate metric, and describe a way to select the optimal combination of sampling and ranking techniques for any given computational budget.

Via

Access Paper or Ask Questions

Inkorrect: Online Handwriting Spelling Correction

Feb 28, 2022

Andrii Maksai, Henry Rowley, Jesse Berent, Claudiu Musat

Figure 1 for Inkorrect: Online Handwriting Spelling Correction

Figure 2 for Inkorrect: Online Handwriting Spelling Correction

Figure 3 for Inkorrect: Online Handwriting Spelling Correction

Figure 4 for Inkorrect: Online Handwriting Spelling Correction

Abstract:We introduce Inkorrect, a data- and label-efficient approach for online handwriting (Digital Ink) spelling correction - DISC. Unlike previous work, the proposed method does not require multiple samples from the same writer, or access to character level segmentation. We show that existing automatic evaluation metrics do not fully capture and are not correlated with the human perception of the quality of the spelling correction, and propose new ones that correlate with human perception. We additionally surface an interesting phenomenon: a trade-off between the similarity and recognizability of the spell-corrected inks. We further create a family of models corresponding to different points on the Pareto frontier between those two axes. We show that Inkorrect's Pareto frontier dominates the points that correspond to prior work.

Via

Access Paper or Ask Questions

Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts

Apr 26, 2021

Martin Milenkoski, Diego Antognini, Claudiu Musat

Figure 1 for Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts

Figure 2 for Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts

Figure 3 for Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts

Figure 4 for Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts

Abstract:In this paper we describe a method to tackle data sparsity and create recommendations in domains with limited knowledge about the user preferences. We expand the variational autoencoder collaborative filtering from a single-domain to a multi domain setting. The intuition is that user-item interactions in a source domain can augment the recommendation quality in a target domain. The intuition can be taken to its extreme, where, in a cross-domain setup, the user history in a source domain is enough to generate high quality recommendations in a target one. We thus create a Product-of-Experts (POE) architecture for recommendations that jointly models user-item interactions across multiple domains. The method is resilient to missing data for one or more of the domains, which is a situation often found in real life. We present results on two widely-used datasets - Amazon and Yelp, which support the claim that holistic user preference knowledge leads to better recommendations. Surprisingly, we find that in select cases, a POE recommender that does not access the target domain user representation can surpass a strong VAE recommender baseline trained on the target domain. We complete the analysis with a study of the reasons behind this outperformance and an in-depth look at the resulting embedding spaces.

* Under review. 16 pages, 5 figures, 2 tables

Via

Access Paper or Ask Questions

OpenCSI: An Open-Source Dataset for Indoor Localization Using CSI-Based Fingerprinting

Apr 16, 2021

Arthur Gassner, Claudiu Musat, Alexandru Rusu, Andreas Burg

Figure 1 for OpenCSI: An Open-Source Dataset for Indoor Localization Using CSI-Based Fingerprinting

Figure 2 for OpenCSI: An Open-Source Dataset for Indoor Localization Using CSI-Based Fingerprinting

Figure 3 for OpenCSI: An Open-Source Dataset for Indoor Localization Using CSI-Based Fingerprinting

Figure 4 for OpenCSI: An Open-Source Dataset for Indoor Localization Using CSI-Based Fingerprinting

Abstract:Many applications require accurate indoor localization. Fingerprint-based localization methods propose a solution to this problem, but rely on a radio map that is effort-intensive to acquire. We automate the radio map acquisition phase using a software-defined radio (SDR) and a wheeled robot. Furthermore, we open-source a radio map acquired with our automated tool for a 3GPP Long-Term Evolution (LTE) wireless link. To the best of our knowledge, this is the first publicly available radio map containing channel state information (CSI). Finally, we describe first localization experiments on this radio map using a convolutional neural network to regress for location coordinates.

Via

Access Paper or Ask Questions

Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context

Sep 19, 2020

Milena Filipovic, Blagoj Mitrevski, Diego Antognini, Emma Lejal Glaude, Boi Faltings, Claudiu Musat

Figure 1 for Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context

Figure 2 for Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context

Figure 3 for Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context

Figure 4 for Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context

Abstract:Simulating online recommender system performance is notoriously difficult and the discrepancy between the online and offline behaviors is typically not accounted for in offline evaluations. Recommender systems research tends to evaluate model performance on randomly sampled targets, yet the same systems are later used to predict user behavior sequentially from a fixed point in time. This disparity permits weaknesses to go unnoticed until the model is deployed in a production setting. We first demonstrate how omitting temporal context when evaluating recommender system performance leads to false confidence. To overcome this, we propose an offline evaluation protocol modeling the real-life use-case that simultaneously accounts for temporal context. Next, we propose a training procedure to further embed the temporal context in existing models: we introduce it in a multi-objective approach to traditionally time-unaware recommender systems. We confirm the advantage of adding a temporal objective via the proposed evaluation protocol. Finally, we validate that the Pareto Fronts obtained with the added objective dominate those produced by state-of-the-art models that are only optimized for accuracy on three real-world publicly available datasets. The results show that including our temporal objective can improve recall@20 by up to 20%.

* Under review. 8 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions