The literature is abundant with methodologies focusing on using transformer architectures due to their prominence in wireless signal processing and their capability to capture long-range dependencies via attention mechanisms. In particular, depthwise separable convolutions enhance parameter efficiency for the process of high-dimensional data characteristics of MIMO systems. In this work, we introduce a novel unsupervised deep learning framework that integrates depthwise separable convolutions and transformers to generate beamforming weights under imperfect channel state information (CSI) for a multi-user single-input multiple-output (MU-SIMO) system in dense urban environments. The primary goal is to enhance throughput by maximizing sum-rate while ensuring reliable communication. Spectral efficiency and block error rate (BLER) are considered as performance metrics. Experiments are carried out under various conditions to compare the performance of the proposed NNBF framework against baseline methods zero-forcing beamforming (ZFBF) and minimum mean square error (MMSE) beamforming. Experimental results demonstrate the superiority of the proposed framework over the baseline techniques.