Genetic Programming (GP) is a computationally intensive technique which is naturally parallel in nature. Consequently, many attempts have been made to improve its run-time from exploiting highly parallel hardware such as GPUs. However, a second methodology of improving the speed of GP is through efficiency techniques such as subtree caching. However achieving parallel performance and efficiency is a difficult task. This paper will demonstrate an efficiency saving for GP compatible with the harnessing of parallel CPU hardware by exploiting tournament selection. Significant efficiency savings are demonstrated whilst retaining the capability of a high performance parallel implementation of GP. Indeed, a 74% improvement in the speed of GP is achieved with a peak rate of 96 billion GPop/s for classification type problems.