Jake
Abstract:The quality control of printed circuit boards (PCBs) is paramount in advancing electronic device technology. While numerous machine learning methodologies have been utilized to augment defect detection efficiency and accuracy, previous studies have predominantly focused on optimizing individual models for specific defect types, often overlooking the potential synergies between different approaches. This paper introduces a comprehensive inspection framework leveraging an ensemble learning strategy to address this gap. Initially, we utilize four distinct PCB defect detection models utilizing state-of-the-art methods: EfficientDet, MobileNet SSDv2, Faster RCNN, and YOLOv5. Each method is capable of identifying PCB defects independently. Subsequently, we integrate these models into an ensemble learning framework to enhance detection performance. A comparative analysis reveals that our ensemble learning framework significantly outperforms individual methods, achieving a 95% accuracy in detecting diverse PCB defects. These findings underscore the efficacy of our proposed ensemble learning framework in enhancing PCB quality control processes.
Abstract:Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot's performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at https://github.com/RayYoh/OCRM_survey.
Abstract:Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness and the ability to customize complex objects with user-friendly inputs. Our code is available at $\href{https://github.com/zhongzero/Multi-Modal-Prompt}{https://github.com/zhongzero/Multi-Modal-Prompt}$.
Abstract:Image inpainting, the task of reconstructing missing segments in corrupted images using available data, faces challenges in ensuring consistency and fidelity, especially under information-scarce conditions. Traditional evaluation methods, heavily dependent on the existence of unmasked reference images, inherently favor certain inpainting outcomes, introducing biases. Addressing this issue, we introduce an innovative evaluation paradigm that utilizes a self-supervised metric based on multiple re-inpainting passes. This approach, diverging from conventional reliance on direct comparisons in pixel or feature space with original images, emphasizes the principle of self-consistency to enable the exploration of various viable inpainting solutions, effectively reducing biases. Our extensive experiments across numerous benchmarks validate the alignment of our evaluation method with human judgment.
Abstract:Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion models on text-to-image generation capabilities and trades computation during inference time for sample quality. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Our code is available at https://rlcm.owenoertell.com
Abstract:Plant leaf identification is crucial for biodiversity protection and conservation and has gradually attracted the attention of academia in recent years. Due to the high similarity among different varieties, leaf cultivar recognition is also considered to be an ultra-fine-grained visual classification (UFGVC) task, which is facing a huge challenge. In practice, an instance may be related to multiple varieties to varying degrees, especially in the UFGVC datasets. However, deep learning methods trained on one-hot labels fail to reflect patterns shared across categories and thus perform poorly on this task. To address this issue, we generate soft targets integrated with inter-class similarity information. Specifically, we continuously update the prototypical features for each category and then capture the similarity scores between instances and prototypes accordingly. Original one-hot labels and the similarity scores are incorporated to yield enhanced labels. Prototype-enhanced soft labels not only contain original one-hot label information, but also introduce rich inter-category semantic association information, thus providing more effective supervision for deep model training. Extensive experimental results on public datasets show that our method can significantly improve the performance on the UFGVC task of leaf cultivar identification.
Abstract:Cross-domain few-shot learning (CDFSL) remains a largely unsolved problem in the area of computer vision, while self-supervised learning presents a promising solution. Both learning methods attempt to alleviate the dependency of deep networks on the requirement of large-scale labeled data. Although self-supervised methods have recently advanced dramatically, their utility on CDFSL is relatively unexplored. In this paper, we investigate the role of self-supervised representation learning in the context of CDFSL via a thorough evaluation of existing methods. It comes as a surprise that even with shallow architectures or small training datasets, self-supervised methods can perform favorably compared to the existing SOTA methods. Nevertheless, no single self-supervised approach dominates all datasets indicating that existing self-supervised methods are not universally applicable. In addition, we find that representations extracted from self-supervised methods exhibit stronger robustness than the supervised method. Intriguingly, whether self-supervised representations perform well on the source domain has little correlation with their applicability on the target domain. As part of our study, we conduct an objective measurement of the performance for six kinds of representative classifiers. The results suggest Prototypical Classifier as the standard evaluation recipe for CDFSL.
Abstract:Music arrangement generation is a subtask of automatic music generation, which involves reconstructing and re-conceptualizing a piece with new compositional techniques. Such a generation process inevitably requires reference from the original melody, chord progression, or other structural information. Despite some promising models for arrangement, they lack more refined data to achieve better evaluations and more practical results. In this paper, we propose POP909, a dataset which contains multiple versions of the piano arrangements of 909 popular songs created by professional musicians. The main body of the dataset contains the vocal melody, the lead instrument melody, and the piano accompaniment for each song in MIDI format, which are aligned to the original audio files. Furthermore, we provide the annotations of tempo, beat, key, and chords, where the tempo curves are hand-labeled and others are done by MIR algorithms. Finally, we conduct several baseline experiments with this dataset using standard deep music generation algorithms.
Abstract:The dominant approach for music representation learning involves the deep unsupervised model family variational autoencoder (VAE). However, most, if not all, viable attempts on this problem have largely been limited to monophonic music. Normally composed of richer modality and more complex musical structures, the polyphonic counterpart has yet to be addressed in the context of music representation learning. In this work, we propose the PianoTree VAE, a novel tree-structure extension upon VAE aiming to fit the polyphonic music learning. The experiments prove the validity of the PianoTree VAE via (i)-semantically meaningful latent code for polyphonic segments; (ii)-more satisfiable reconstruction aside of decent geometry learned in the latent space; (iii)-this model's benefits to the variety of the downstream music generation.
Abstract:Static image action recognition, which aims to recognize action based on a single image, usually relies on expensive human labeling effort such as adequate labeled action images and large-scale labeled image dataset. In contrast, abundant unlabeled videos can be economically obtained. Therefore, several works have explored using unlabeled videos to facilitate image action recognition, which can be categorized into the following two groups: (a) enhance visual representations of action images with a designed proxy task on unlabeled videos, which falls into the scope of self-supervised learning; (b) generate auxiliary representations for action images with the generator learned from unlabeled videos. In this paper, we integrate the above two strategies in a unified framework, which consists of Visual Representation Enhancement (VRE) module and Motion Representation Augmentation (MRA) module. Specifically, the VRE module includes a proxy task which imposes pseudo motion label constraint and temporal coherence constraint on unlabeled videos, while the MRA module could predict the motion information of a static action image by exploiting unlabeled videos. We demonstrate the superiority of our framework based on four benchmark human action datasets with limited labeled data.