Facial expression manipulation, as an image-to-image translation problem, aims at editing facial expression with a given condition. Previous methods edit an input image under the guidance of a discrete emotion label or absolute condition (e.g., facial action units) to possess the desired expression. However, these methods either suffer from changing condition-irrelevant regions or are inefficient to preserve image quality. In this study, we take these two objectives into consideration and propose a novel conditional GAN model. First, we replace continuous absolute condition with relative condition, specifically, relative action units. With relative action units, the generator learns to only transform regions of interest which are specified by non-zero-valued relative AUs, avoiding estimating the current AUs of input image. Second, our generator is built on U-Net architecture and strengthened by multi-scale feature fusion (MSF) mechanism for high-quality expression editing purpose. Extensive experiments on both quantitative and qualitative evaluation demonstrate the improvements of our proposed approach compared with the state-of-the-art expression editing methods.