One of the ultimate quests of question answering (QA) is to deploy a system that can answer any type of question from the users, and refrain from answering when it does not know the answer. While recent advancements in scaling large language models (LLMs) brought significant improvements on various QA datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. In this paper, we first provide empirical evidence that state-of-the-art LLMs such as Codex suffer from poor generalizability on question types beyond those seen in the prompt. To address this, we propose a Mixture-of-Prompt-Experts (MOPE) system that ensembles multiple specialized LLMs. We first implement each specialized model based on the same backbone model (Codex) but with prompts optimized for different reasoning categories including factual, multihop, mathematical, and commonsense reasoning. By strategically selecting the best specialized model for each given question, our MOPE system significantly outperforms any single specialized model on a collection of 12 QA datasets from four reasoning types. Moreover, the attribution and agreement among specialized expert models offer greater interpretability, allowing for better selective question answering. Our human study further confirms that presenting the expert predictions and answer selection process helps annotators more accurately decide when to trust the system's output. We release all code and data to facilitate future work.