With the rapid development of the Internet and social media, multi-modal data (text and image) is increasingly important in sentiment analysis tasks. However, the existing methods are difficult to effectively fuse text and image features, which limits the accuracy of analysis. To solve this problem, a multimodal sentiment analysis framework combining BERT and ResNet was proposed. BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision. Firstly, BERT is used to extract the text feature vector, and ResNet is used to extract the image feature representation. Then, a variety of feature fusion strategies are explored, and finally the fusion model based on attention mechanism is selected to make full use of the complementary information between text and image. Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%. This study not only provides new ideas and methods for multimodal sentiment analysis, but also demonstrates the application potential of BERT and ResNet in cross-domain fusion. In the future, more advanced feature fusion techniques and optimization strategies will be explored to further improve the accuracy and generalization ability of multimodal sentiment analysis.