In this paper, we presented a preliminary study for tactical driver behavior detection from untrimmed naturalistic driving recordings. While supervised learning based detection is a common approach, it suffers when labeled data is scarce. Manual annotation is both time-consuming and expensive. To emphasize this problem, we experimented on a 104-hour real-world naturalistic driving dataset with a set of predefined driving behaviors annotated. There are three challenges in the dataset. First, predefined driving behaviors are sparse in a naturalistic driving setting. Second, the distribution of driving behaviors is long-tail. Third, a huge intra-class variation is observed. To address these issues, recent self-supervised and supervised learning and fusion of multimodal cues are leveraged into our architecture design. Preliminary experiments and discussions are reported.