While deep learning has revolutionized research and applications in NLP and computer vision, this has not yet been the case for behavioral modeling and behavioral health applications. This is because the domain's datasets are smaller, have heterogeneous datatypes, and typically exhibit a large degree of missingness. Therefore, off-the-shelf deep learning models require significant, often prohibitive, adaptation. Accordingly, many research applications still rely on manually coded features with boosted tree models, sometimes with task-specific features handcrafted by experts. Here, we address these challenges by providing a neural architecture framework for mobile sensing data that can learn generalizable feature representations from time series and demonstrates the feasibility of transfer learning on small data domains through finetuning. This architecture combines benefits from CNN and Trans-former architectures to (1) enable better prediction performance by learning directly from raw minute-level sensor data without the need for handcrafted features by up to 0.33 ROC AUC, and (2) use pretraining to outperform simpler neural models and boosted decision trees with data from as few a dozen participants.