Meeting the high data rate demands of modern applications necessitates the utilization of high-frequency spectrum bands, including millimeter-wave and sub-terahertz bands. However, these frequencies require precise alignment of narrow communication beams between transmitters and receivers, typically resulting in significant beam training overhead. This paper introduces a novel end-to-end vision-aided beamforming framework that utilizes images to predict optimal beams while considering geometric adjustments to reduce overhead. Our model demonstrates robust adaptability to dynamic environments without relying on additional training data where the experimental results indicate a top-5 beam prediction accuracy of 98.96%, significantly surpassing current state-of-the-art solutions in vision-aided beamforming.