Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Igor Markov

RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Mar 29, 2023

Igor Markov, Sergey Nesteruk, Andrey Kuznetsov, Denis Dimitrov

Figure 1 for RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Figure 2 for RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Figure 3 for RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Figure 4 for RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Abstract:Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While there are many high-quality datasets for English text recognition; there are no available datasets for Russian language. In this paper, we present a large-scale human-labeled dataset for Russian text recognition in-the-wild. We also publish a synthetic dataset and code to reproduce the generation process

* 5 pages, 6 figures, 2 tables

Via

Access Paper or Ask Questions

Federated Calibration and Evaluation of Binary Classifiers

Oct 22, 2022

Graham Cormode, Igor Markov

Abstract:We address two major obstacles to practical use of supervised classifiers on distributed private data. Whether a classifier was trained by a federation of cooperating clients or trained centrally out of distribution, (1) the output scores must be calibrated, and (2) performance metrics must be evaluated -- all without assembling labels in one place. In particular, we show how to perform calibration and compute precision, recall, accuracy and ROC-AUC in the federated setting under three privacy models (i) secure aggregation, (ii) distributed differential privacy, (iii) local differential privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efficiency. They also help decide whether a given application has sufficient data to support federated calibration and evaluation.

* 24 pages

Via

Access Paper or Ask Questions

Data-Driven Mitigation of Adversarial Text Perturbation

Feb 19, 2022

Rasika Bhalerao, Mohammad Al-Rubaie, Anand Bhaskar, Igor Markov

Figure 1 for Data-Driven Mitigation of Adversarial Text Perturbation

Figure 2 for Data-Driven Mitigation of Adversarial Text Perturbation

Figure 3 for Data-Driven Mitigation of Adversarial Text Perturbation

Figure 4 for Data-Driven Mitigation of Adversarial Text Perturbation

Abstract:Social networks have become an indispensable part of our lives, with billions of people producing ever-increasing amounts of text. At such scales, content policies and their enforcement become paramount. To automate moderation, questionable content is detected by Natural Language Processing (NLP) classifiers. However, high-performance classifiers are hampered by misspellings and adversarial text perturbations. In this paper, we classify intentional and unintentional adversarial text perturbation into ten types and propose a deobfuscation pipeline to make NLP models robust to such perturbations. We propose Continuous Word2Vec (CW2V), our data-driven method to learn word embeddings that ensures that perturbations of words have embeddings similar to those of the original words. We show that CW2V embeddings are generally more robust to text perturbations than embeddings based on character ngrams. Our robust classification pipeline combines deobfuscation and classification, using proposed defense methods and word embeddings to classify whether Facebook posts are requesting engagement such as likes. Our pipeline results in engagement bait classification that goes from 0.70 to 0.67 AUC with adversarial text perturbation, while character ngram-based word embedding methods result in downstream classification that goes from 0.76 to 0.64.

Via

Access Paper or Ask Questions

Picasso: Model-free Feature Visualization

Nov 24, 2021

Binh Vu, Igor Markov

Figure 1 for Picasso: Model-free Feature Visualization

Figure 2 for Picasso: Model-free Feature Visualization

Figure 3 for Picasso: Model-free Feature Visualization

Figure 4 for Picasso: Model-free Feature Visualization

Abstract:Today, Machine Learning (ML) applications can have access to tens of thousands of features. With such feature sets, efficiently browsing and curating subsets of most relevant features is a challenge. In this paper, we present a novel approach to visualize up to several thousands of features in a single image. The image not only shows information on individual features, but also expresses feature interactions via the relative positioning of features.

Via

Access Paper or Ask Questions