Fake News Detection (FND) is an essential field in natural language processing that aims to identify and check the truthfulness of major claims in a news article to decide the news veracity. FND finds its uses in preventing social, political and national damage caused due to misrepresentation of facts which may harm a certain section of society. Further, with the explosive rise in fake news dissemination over social media, including images and text, it has become imperative to identify fake news faster and more accurately. To solve this problem, this work investigates a novel multimodal stacked ensemble-based approach (SEMIFND) to fake news detection. Focus is also kept on ensuring faster performance with fewer parameters. Moreover, to improve multimodal performance, a deep unimodal analysis is done on the image modality to identify NasNet Mobile as the most appropriate model for the task. For text, an ensemble of BERT and ELECTRA is used. The approach was evaluated on two datasets: Twitter MediaEval and Weibo Corpus. The suggested framework offered accuracies of 85.80% and 86.83% on the Twitter and Weibo datasets respectively. These reported metrics are found to be superior when compared to similar recent works. Further, we also report a reduction in the number of parameters used in training when compared to recent relevant works. SEMI-FND offers an overall parameter reduction of at least 20% with unimodal parametric reduction on text being 60%. Therefore, based on the investigations presented, it is concluded that the application of a stacked ensembling significantly improves FND over other approaches while also improving speed.