Abstract:Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.
Abstract:Accurate recognition of human motion intention (HMI) is beneficial for exoskeleton robots to improve the wearing comfort level and achieve natural human-robot interaction. A classifier trained on labeled source subjects (domains) performs poorly on unlabeled target subject since the difference in individual motor characteristics. The unsupervised domain adaptation (UDA) method has become an effective way to this problem. However, the labeled data are collected from multiple source subjects that might be different not only from the target subject but also from each other. The current UDA methods for HMI recognition ignore the difference between each source subject, which reduces the classification accuracy. Therefore, this paper considers the differences between source subjects and develops a novel theory and algorithm for UDA to recognize HMI, where the margin disparity discrepancy (MDD) is extended to multi-source UDA theory and a novel weight-aware-based multi-source UDA algorithm (WMDD) is proposed. The source domain weight, which can be adjusted adaptively by the MDD between each source subject and target subject, is incorporated into UDA to measure the differences between source subjects. The developed multi-source UDA theory is theoretical and the generalization error on target subject is guaranteed. The theory can be transformed into an optimization problem for UDA, successfully bridging the gap between theory and algorithm. Moreover, a lightweight network is employed to guarantee the real-time of classification and the adversarial learning between feature generator and ensemble classifiers is utilized to further improve the generalization ability. The extensive experiments verify theoretical analysis and show that WMDD outperforms previous UDA methods on HMI recognition tasks.