Anticipating the motion of neighboring vehicles is crucial for autonomous driving, especially on congested highways where even slight motion variations can result in catastrophic collisions. An accurate prediction of a future trajectory does not just rely on the previous trajectory, but also, more importantly, a simulation of the complex interactions between other vehicles nearby. Most state-of-the-art networks built to tackle the problem assume readily available past trajectory points, hence lacking a full end-to-end pipeline with direct video-to-output mechanism. In this article, we thus propose a novel end-to-end architecture that takes raw video inputs and outputs future trajectory predictions. It first extracts and tracks the 3D location of the nearby vehicles via multi-head attention-based regression networks as well as non-linear optimization. This provides the past trajectory points which then feeds into the trajectory prediction algorithm consisting of an attention-based LSTM encoder-decoder architecture, which allows it to model the complicated interdependence between the vehicles and make an accurate prediction of the future trajectory points of the surrounding vehicles. The proposed model is evaluated on the large-scale BLVD dataset, and has also been implemented on CARLA. The experimental results demonstrate that our approach outperforms various state-of-the-art models.