Recent years witnessed an increase in the amount of research on the task of Question Difficulty Estimation from Text QDET with Natural Language Processing (NLP) techniques, with the goal of targeting the limitations of traditional approaches to question calibration. However, almost the entirety of previous research focused on single silos, without performing quantitative comparisons between different models or across datasets from different educational domains. In this work, we aim at filling this gap, by quantitatively analyzing several approaches proposed in previous research, and comparing their performance on three publicly available real world datasets containing questions of different types from different educational domains. Specifically, we consider reading comprehension Multiple Choice Questions (MCQs), science MCQs, and math questions. We find that Transformer based models are the best performing across different educational domains, with DistilBERT performing almost as well as BERT, and that they outperform other approaches even on smaller datasets. As for the other models, the hybrid ones often outperform the ones based on a single type of features, the ones based on linguistic features perform well on reading comprehension questions, while frequency based features (TF-IDF) and word embeddings (word2vec) perform better in domain knowledge assessment.