Abstract:Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
Abstract:Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forward processes modelled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment SGMs with different forward processes, some of which are newly proposed in this paper, for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity, and the models with our newly proposed forward processes can outperform existing models.
Abstract:Shape estimation of sweetpotato (SP) storage roots is inherently challenging due to their varied size and shape characteristics. Even measuring "simple" metrics, such as length and width, requires significant time investments either directly in-field or afterward using automated graders. In this paper, we present the results of a model that can perform grading and provide yield estimates directly in the field quicker than manual measurements. Detectron2, a library consisting of deep-learning object detection algorithms, was used to implement Mask R-CNN, an instance segmentation model. This model was deployed for in-field grade estimation of SPs and evaluated against an optical sorter. Storage roots from various clones imaged with a cellphone during trials between 2019 and 2020, were used in the model's training and validation to fine-tune a model to detect SPs. Our results showed that the model could distinguish individual SPs in various environmental conditions including variations in lighting and soil characteristics. RMSE for length, width, and weight, from the model compared to a commercial optical sorter, were 0.66 cm, 1.22 cm, and 74.73 g, respectively, while the RMSE of root counts per plot was 5.27 roots, with r^2 = 0.8. This phenotyping strategy has the potential enable rapid yield estimates in the field without the need for sophisticated and costly optical sorters and may be more readily deployed in environments with limited access to these kinds of resources or facilities.