Accurate vision-based speed estimation is much more cost-effective than traditional methods based on radar or LiDAR. However, it is also challenging due to the limitations of perspective projection on a discrete sensor, as well as the high sensitivity to calibration, lighting and weather conditions. Interestingly, deep learning approaches (which dominate the field of computer vision) are very limited in this context due to the lack of available data. Indeed, obtaining video sequences of real road traffic with accurate speed values associated with each vehicle is very complex and costly, and the number of available datasets is very limited. Recently, some approaches are focusing on the use of synthetic data. However, it is still unclear how models trained on synthetic data can be effectively applied to real world conditions. In this work, we propose the use of digital-twins using CARLA simulator to generate a large dataset representative of a specific real-world camera. The synthetic dataset contains a large variability of vehicle types, colours, speeds, lighting and weather conditions. A 3D CNN model is trained on the digital twin and tested on the real sequences. Unlike previous approaches that generate multi-camera sequences, we found that the gap between the the real and the virtual conditions is a key factor in obtaining low speed estimation errors. Even with a preliminary approach, the mean absolute error obtained remains below 3km/h.