Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of multiple years. Addressing this diagnostic odyssey thus have substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artificial intelligence algorithms to facilitate clinical diagnosis, in prioritizing candidate diseases to be further examined by lab tests or genetic assays, or in helping the phenotype-driven reinterpretation of genome/exome sequencing data. However, existing methods using frontal facial photo were built on conventional Convolutional Neural Networks (CNNs), rely exclusively on facial images, and cannot capture non-facial phenotypic traits and demographic information essential for guiding accurate diagnoses. Here we introduce GestaltMML, a multimodal machine learning (MML) approach solely based on the Transformer architecture. It integrates the facial images, demographic information (age, sex, ethnicity), and clinical notes of patients to improve prediction accuracy. Furthermore, we also introduce GestaltGPT, a GPT-based methodology with few-short learning capacities that exclusively harnesses textual inputs using a range of large language models (LLMs) including Llama 2, GPT-J and Falcon. We evaluated these methods on a diverse range of datasets, including 449 diseases from the GestaltMatcher Database, several in-house datasets on Beckwith-Wiedemann syndrome, Sotos syndrome, NAA10-related syndrome (neurodevelopmental syndrome) and others. Our results suggest that GestaltMML/GestaltGPT effectively incorporate multiple modalities of data, greatly narrow down candidate genetic diagnosis of rare diseases, and may facilitate the reinterpretation of genome/exome sequencing data.