Human affective behavior analysis focuses on analyzing human expressions or other behaviors, which helps improve the understanding of human psychology. CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW) makes great efforts to provide the diversity data for the recognition of the commonly used emotion representations, including Action Units~(AU), basic expression categories and Valence-Arousal~(VA). In this paper, we introduce our submission to the CVPR 2023: ABAW5 for AU detection, expression classification, VA estimation and emotional reaction intensity (ERI) estimation. First of all, we introduce the vision information from an MAE model, which has been pre-trained on a large-scale face image dataset in a self-supervised manner. Then the MAE encoder part is finetuned on the ABAW challenges on the single frame of Aff-wild2 dataset. We also exploit the multi-modal and temporal information from the videos and design a transformer-based framework to fusion the multi-modal features. Moreover, we construct a novel two-branch collaboration training strategy to further enhance the model generalization by randomly interpolating the logits space. The extensive quantitative experiments, as well as ablation studies on the Aff-Wild2 dataset and Hume-Reaction dataset prove the effectiveness of our proposed method.