Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mikhail Romanov

YaART: Yet Another ART Rendering Technology

Apr 08, 2024

Sergey Kastryulin, Artem Konev, Alexander Shishenya, Eugene Lyapustin, Artem Khurshudov, Alexander Tselousov, Nikita Vinokurov, Denis Kuznedelev, Alexander Markovich, Grigoriy Livshits(+13 more)

Abstract:In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.

* Prompts and additional information are available on the project page, see https://ya.ru/ai/art/paper-yaart-v1

Via

Access Paper or Ask Questions

Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data

Jun 05, 2023

Nikolay Patakin, Mikhail Romanov, Anna Vorontsova, Mikhail Artemyev, Anton Konushin

Abstract:Nowadays, robotics, AR, and 3D modeling applications attract considerable attention to single-view depth estimation (SVDE) as it allows estimating scene geometry from a single RGB image. Recent works have demonstrated that the accuracy of an SVDE method hugely depends on the diversity and volume of the training data. However, RGB-D datasets obtained via depth capturing or 3D reconstruction are typically small, synthetic datasets are not photorealistic enough, and all these datasets lack diversity. The large-scale and diverse data can be sourced from stereo images or stereo videos from the web. Typically being uncalibrated, stereo data provides disparities up to unknown shift (geometrically incomplete data), so stereo-trained SVDE methods cannot recover 3D geometry. It was recently shown that the distorted point clouds obtained with a stereo-trained SVDE method can be corrected with additional point cloud modules (PCM) separately trained on the geometrically complete data. On the contrary, we propose GP$^{2}$, General-Purpose and Geometry-Preserving training scheme, and show that conventional SVDE models can learn correct shifts themselves without any post-processing, benefiting from using stereo data even in the geometry-preserving setting. Through experiments on different dataset mixtures, we prove that GP$^{2}$-trained models outperform methods relying on PCM in both accuracy and speed, and report the state-of-the-art results in the general-purpose geometry-preserving SVDE. Moreover, we show that SVDE models can learn to predict geometrically correct depth even when geometrically complete data comprises the minor part of the training set.

* CVPR 2022

Via

Access Paper or Ask Questions

Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Sep 25, 2020

Mikhail Romanov, Nikolay Patatkin, Anna Vorontsova, Anton Konushin

Figure 1 for Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Figure 2 for Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Figure 3 for Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Figure 4 for Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Abstract:Single-view depth estimation plays a crucial role in scene understanding for AR applications and 3D modelling as it allows to retrieve the geometry of a scene. However, it is only possible if the inverse depth estimates are unbiased, i.e. they are either absolute or Up-to-Scale (UTS). In recent years, great progress has been made in general-purpose single-view depth estimation. Nevertheless, the latest general-purpose models were trained using ranking or on Up-to-Shift-Scale (UTSS) data. As a result, they provide UTSS predictions that cannot be used to reconstruct scene geometry. In this work, we strive to build a general-purpose single-view UTS depth estimation model. Following Ranftl et. al., we train our model on a mixture of datasets and test it on several previously unseen datasets. We show that our method outperforms previous state-of-the-art UTS models. We train several light-weight models following the proposed training scheme and prove that our ideas are applicable for computationally efficient depth estimation.

Via

Access Paper or Ask Questions

Learning High-Resolution Domain-Specific Representations with a GAN Generator

Jun 18, 2020

Danil Galeev, Konstantin Sofiiuk, Danila Rukhovich, Mikhail Romanov, Olga Barinova, Anton Konushin

Figure 1 for Learning High-Resolution Domain-Specific Representations with a GAN Generator

Figure 2 for Learning High-Resolution Domain-Specific Representations with a GAN Generator

Figure 3 for Learning High-Resolution Domain-Specific Representations with a GAN Generator

Figure 4 for Learning High-Resolution Domain-Specific Representations with a GAN Generator

Abstract:In recent years generative models of visual data have made a great progress, and now they are able to produce images of high quality and diversity. In this work we study representations learnt by a GAN generator. First, we show that these representations can be easily projected onto semantic segmentation map using a lightweight decoder. We find that such semantic projection can be learnt from just a few annotated images. Based on this finding, we propose LayerMatch scheme for approximating the representation of a GAN generator that can be used for unsupervised domain-specific pretraining. We consider the semi-supervised learning scenario when a small amount of labeled data is available along with a large unlabeled dataset from the same domain. We find that the use of LayerMatch-pretrained backbone leads to superior accuracy compared to standard supervised pretraining on ImageNet. Moreover, this simple approach also outperforms recent semi-supervised semantic segmentation methods that use both labeled and unlabeled data during training. Source code for reproducing our experiments will be available at the time of publication.

Via

Access Paper or Ask Questions

Double Refinement Network for Efficient Indoor Monocular Depth Estimation

Nov 20, 2018

Nikita Durasov, Mikhail Romanov, Valeriya Bubnova, Anton Konushin

Figure 1 for Double Refinement Network for Efficient Indoor Monocular Depth Estimation

Figure 2 for Double Refinement Network for Efficient Indoor Monocular Depth Estimation

Figure 3 for Double Refinement Network for Efficient Indoor Monocular Depth Estimation

Figure 4 for Double Refinement Network for Efficient Indoor Monocular Depth Estimation

Abstract:Monocular Depth Estimation is an important problem of Computer Vision that may be solved with Neural Networks and Deep Learning nowadays. Though recent works in this area have shown significant improvement in accuracy, state-of-the-art methods require large memory and time resources. The main purpose of this paper is to improve performance of the latest solutions with no decrease in accuracy. To achieve this, we propose a Double Refinement Network architecture. We evaluate the results using the standard benchmark RGB-D dataset NYU Depth v2. The results are equal to the current state-of-the-art, while frames per second rate of our approach is significantly higher (up to 15 times speedup per image with batch size 1), RAM per image is significantly lower.

Via

Access Paper or Ask Questions