Accurately predicting future pedestrian trajectories is crucial across various domains. Due to the uncertainty in future pedestrian trajectories, it is important to learn complex spatio-temporal representations in multi-agent scenarios. To address this, we propose a novel Cross-Correction Framework (CCF) to learn spatio-temporal representations of pedestrian trajectories better. Our framework consists of two trajectory prediction models, known as subnets, which share the same architecture and are trained with both cross-correction loss and trajectory prediction loss. Cross-correction leverages the learning from both subnets and enables them to refine their underlying representations of trajectories through a mutual correction mechanism. Specifically, we use the cross-correction loss to learn how to correct each other through an inter-subnet interaction. To induce diverse learning among the subnets, we use the transformed observed trajectories produced by a neural network as input to one subnet and the original observed trajectories as input to the other subnet. We utilize transformer-based encoder-decoder architecture for each subnet to capture motion and social interaction among pedestrians. The encoder of the transformer captures motion patterns in trajectories, while the decoder focuses on pedestrian interactions with neighbors. Each subnet performs the primary task of predicting future trajectories (a regression task) along with the secondary task of classifying the predicted trajectories (a classification task). Extensive experiments on real-world benchmark datasets such as ETH-UCY and SDD demonstrate the efficacy of our proposed framework, CCF, in precisely predicting pedestrian future trajectories. We also conducted several ablation experiments to demonstrate the effectiveness of various modules and loss functions used in our approach.