In this work, we propose a novel staged depthwise correlation and feature fusion network, named DCFFNet, to further optimize the feature extraction for visual tracking. We build our deep tracker upon a siamese network architecture, which is offline trained from scratch on multiple large-scale datasets in an end-to-end manner. The model contains a core component, that is, depthwise correlation and feature fusion module (correlation-fusion module), which facilitates model to learn a set of optimal weights for a specific object by utilizing ensembles of multi-level features from lower and higher layers and multi-channel semantics on the same layer. We combine the modified ResNet-50 with the proposed correlation-fusion layer to constitute the feature extractor of our model. In training process, we find the training of model become more stable, that benifits from the correlation-fusion module. For comprehensive evaluations of performance, we implement our tracker on the popular benchmarks, including OTB100, VOT2018 and LaSOT. Extensive experiment results demonstrate that our proposed method achieves favorably competitive performance against many leading trackers in terms of accuracy and precision, while satisfying the real-time requirements of applications.