In the dynamic realm of deepfake detection, this work presents an innovative approach to validate video content. The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is uniquely tailored to capture spatiotemporal features via sliding filters, extending through both spatial and temporal dimensions. This configuration enables nuanced pattern recognition in pixel arrangement and temporal evolution across frames. Simultaneously, the 2D model leverages EfficientNet architecture, harnessing auto-scaling in Convolutional Neural Networks. Notably, this ensemble integrates Voting Ensembles and Adaptive Weighted Ensembling. Strategic prioritization of the 3-dimensional model's output capitalizes on its exceptional spatio-temporal feature extraction. Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfake generation's deceptive practices.