Abstract:As one of the main solutions to the information overload problem, recommender systems are widely used in daily life. In the recent emerging micro-video recommendation scenario, micro-videos contain rich multimedia information, involving text, image, video and other multimodal data, and these rich multimodal information conceals users' deep interest in the items. Most of the current recommendation algorithms based on multimodal data use multimodal information to expand the information on the item side, but ignore the different preferences of users for different modal information, and lack the fine-grained mining of the internal connection of multimodal information. To investigate the problems in the micro-video recommendr system mentioned above, we design a hybrid recommendation model based on multimodal information, introduces multimodal information and user-side auxiliary information in the network structure, fully explores the deep interest of users, measures the importance of each dimension of user and item feature representation in the scoring prediction task, makes the application of graph neural network in the recommendation system is improved by using an attention mechanism to fuse the multi-layer state output information, allowing the shallow structural features provided by the intermediate layer to better participate in the prediction task. The recommendation accuracy is improved compared with the traditional recommendation algorithm on different data sets, and the feasibility and effectiveness of our model is verified.