With the recently increasing capabilities of modern vehicles, novel approaches for interaction emerged that go beyond traditional touch-based and voice command approaches. Therefore, hand gestures, head pose, eye gaze, and speech have been extensively investigated in automotive applications for object selection and referencing. Despite these significant advances, existing approaches mostly employ a one-model-fits-all approach unsuitable for varying user behavior and individual differences. Moreover, current referencing approaches either consider these modalities separately or focus on a stationary situation, whereas the situation in a moving vehicle is highly dynamic and subject to safety-critical constraints. In this paper, I propose a research plan for a user-centered adaptive multimodal fusion approach for referencing external objects from a moving vehicle. The proposed plan aims to provide an open-source framework for user-centered adaptation and personalization using user observations and heuristics, multimodal fusion, clustering, transfer-of-learning for model adaptation, and continuous learning, moving towards trusted human-centered artificial intelligence.