The enormous amount of network equipment and users implies a tremendous growth of Internet traffic for multimedia services. To mitigate the traffic pressure, architectures with in-network storage are proposed to cache popular content at nodes in close proximity to users to shorten the backhaul links. Meanwhile, the reduction of transmission distance also contributes to the energy saving. However, due to limited storage, only a fraction of the content can be cached, while caching the most popular content is cost-effective. Correspondingly, it becomes essential to devise an effective popularity prediction method. In this regard, existing efforts adopt dynamic graph neural network (DGNN) models, but it remains challenging to tackle sparse datasets. In this paper, we first propose a reformative temporal graph network, which is named STGN, that utilizes extra semantic messages to enhance the temporal and structural learning of a DGNN model, since the consideration of semantics can help establish implicit paths within the sparse interaction graph and hence improve the prediction performance. Furthermore, we propose a user-specific attention mechanism to fine-grainedly aggregate various semantics. Finally, extensive simulations verify the superiority of our STGN models and demonstrate their high potential in energy-saving.