Abstract:Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot's performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at https://github.com/RayYoh/OCRM_survey.
Abstract:Plant leaf identification is crucial for biodiversity protection and conservation and has gradually attracted the attention of academia in recent years. Due to the high similarity among different varieties, leaf cultivar recognition is also considered to be an ultra-fine-grained visual classification (UFGVC) task, which is facing a huge challenge. In practice, an instance may be related to multiple varieties to varying degrees, especially in the UFGVC datasets. However, deep learning methods trained on one-hot labels fail to reflect patterns shared across categories and thus perform poorly on this task. To address this issue, we generate soft targets integrated with inter-class similarity information. Specifically, we continuously update the prototypical features for each category and then capture the similarity scores between instances and prototypes accordingly. Original one-hot labels and the similarity scores are incorporated to yield enhanced labels. Prototype-enhanced soft labels not only contain original one-hot label information, but also introduce rich inter-category semantic association information, thus providing more effective supervision for deep model training. Extensive experimental results on public datasets show that our method can significantly improve the performance on the UFGVC task of leaf cultivar identification.
Abstract:Cross-domain few-shot learning (CDFSL) remains a largely unsolved problem in the area of computer vision, while self-supervised learning presents a promising solution. Both learning methods attempt to alleviate the dependency of deep networks on the requirement of large-scale labeled data. Although self-supervised methods have recently advanced dramatically, their utility on CDFSL is relatively unexplored. In this paper, we investigate the role of self-supervised representation learning in the context of CDFSL via a thorough evaluation of existing methods. It comes as a surprise that even with shallow architectures or small training datasets, self-supervised methods can perform favorably compared to the existing SOTA methods. Nevertheless, no single self-supervised approach dominates all datasets indicating that existing self-supervised methods are not universally applicable. In addition, we find that representations extracted from self-supervised methods exhibit stronger robustness than the supervised method. Intriguingly, whether self-supervised representations perform well on the source domain has little correlation with their applicability on the target domain. As part of our study, we conduct an objective measurement of the performance for six kinds of representative classifiers. The results suggest Prototypical Classifier as the standard evaluation recipe for CDFSL.
Abstract:Merchandise categories inherently form a semantic hierarchy with different levels of concept abstraction, especially for fine-grained categories. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make predictions less ambiguous. However, previous studies of fine-grained image retrieval primarily focus on semantic similarities or visual similarities. In a real application, merely using visual similarity may not satisfy the need of consumers to search merchandise with real-life images, e.g., given a red coat as a query image, we might get a red suit in recall results only based on visual similarity since they are visually similar. But the users actually want a coat rather than suit even the coat is with different color or texture attributes. We introduce this new problem based on photoshopping in real practice. That's why semantic information are integrated to regularize the margins to make "semantic" prior to "visual". To solve this new problem, we propose a hierarchical adaptive semantic-visual tree (ASVT) to depict the architecture of merchandise categories, which evaluates semantic similarities between different semantic levels and visual similarities within the same semantic class simultaneously. The semantic information satisfies the demand of consumers for similar merchandise with the query while the visual information optimizes the correlations within the semantic class. At each level, we set different margins based on the semantic hierarchy and incorporate them as prior information to learn a fine-grained feature embedding. To evaluate our framework, we propose a new dataset named JDProduct, with hierarchical labels collected from actual image queries and official merchandise images on an online shopping application. Extensive experimental results on the public CARS196 and CUB-
Abstract:Sketch recognition remains a significant challenge due to the limited training data and the substantial intra-class variance of freehand sketches for the same object. Conventional methods for this task often rely on the availability of the temporal order of sketch strokes, additional cues acquired from different modalities and supervised augmentation of sketch datasets with real images, which also limit the applicability and feasibility of these methods in real scenarios. In this paper, we propose a novel sketch-specific data augmentation (SSDA) method that leverages the quantity and quality of the sketches automatically. From the aspect of quantity, we introduce a Bezier pivot based deformation (BPD) strategy to enrich the training data. Towards quality improvement, we present a mean stroke reconstruction (MSR) approach to generate a set of novel types of sketches with smaller intra-class variances. Both of these solutions are unrestricted from any multi-source data and temporal cues of sketches. Furthermore, we show that some recent deep convolutional neural network models that are trained on generic classes of real images can be better choices than most of the elaborate architectures that are designed explicitly for sketch recognition. As SSDA can be integrated with any convolutional neural networks, it has a distinct advantage over the existing methods. Our extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art results (84.27%) on the TU-Berlin dataset, outperforming the human performance by a remarkable 11.17% increase. We also present a new benchmark named Sketchy-R to facilitate future research in sketch recognition. Finally, more experiments show the practical value of our approach to the task of sketch-based image retrieval.
Abstract:In this paper, we propose a novel deep framework for part-level semantic parsing of freehand sketches, which makes three main contributions that are experimentally shown to have substantial practical merit. First, we introduce a new idea named homogeneous transformation to address the problem of domain adaptation. For the task of sketch parsing, there is no available data of labeled freehand sketches that can be directly used for model training. An alternative solution is to learn from the existing parsing data of real images, while the domain adaptation is an inevitable problem. Unlike existing methods that utilize the edge maps of real images to approximate freehand sketches, the proposed homogeneous transformation method transforms the data from two different domains into a homogeneous space to minimize the semantic gap. Second, we design a soft-weighted loss function as guidance for the training process, which gives attention to both the ambiguous label boundary and class imbalance. Third, we present a staged learning strategy to improve the parsing performance of the trained model, which takes advantage of the shared information and specific characteristic from different sketch categories. Extensive experimental results demonstrate the effectiveness of these methods. Specifically, to evaluate the generalization ability of our homogeneous transformation method, additional experiments at the task of sketch-based image retrieval are conducted on the QMUL FG-SBIR dataset. By integrating the proposed three methods into a unified framework, our final deep semantic sketch parsing (DeepSSP) model achieves the state-of-the-art on the public SketchParse dataset.