Recent learning-based approaches show promising performance improvement for scene text removal task. However, these methods usually leave some remnants of text and obtain visually unpleasant results. In this work, we propose a novel "end-to-end" framework based on accurate text stroke detection. Specifically, we decouple the text removal problem into text stroke detection and stroke removal. We design a text stroke detection network and a text removal generation network to solve these two sub-problems separately. Then, we combine these two networks as a processing unit, and cascade this unit to obtain the final model for text removal. Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art approaches for locating and erasing scene text. Since current publicly available datasets are all synthetic and cannot properly measure the performance of different methods, we therefore construct a new real-world dataset, which will be released to facilitate the relevant research.