Abstract:Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities.
Abstract:Versatile Video Coding (VVC) has significantly increased encoding efficiency at the expense of numerous complex coding tools, particularly the flexible Quad-Tree plus Multi-type Tree (QTMT) block partition. This paper proposes a deep learning-based algorithm applied in fast QTMT partition for VVC intra coding. Our solution greatly reduces encoding time by early termination of less-likely intra prediction and partitions with negligible BD-BR increase. Firstly, a redesigned U-Net is recommended as the network's fundamental framework. Next, we design a Quality Parameter (QP) fusion network to regulate the effect of QPs on the partition results. Finally, we adopt a refined post-processing strategy to better balance encoding performance and complexity. Experimental results demonstrate that our solution outperforms the state-of-the-art works with a complexity reduction of 44.74% to 68.76% and a BD-BR increase of 0.60% to 2.33%.
Abstract:The real-time motion prediction of a floating offshore platform refers to forecasting its motions in the following one- or two-wave cycles, which helps improve the performance of a motion compensation system and provides useful early warning information. In this study, we extend a deep learning (DL) model, which could predict the heave and surge motions of a floating semi-submersible 20 to 50 seconds ahead with good accuracy, to quantify its uncertainty of the predictive time series with the help of the dropout technique. By repeating the inference several times, it is found that the collection of the predictive time series is a Gaussian process (GP). The DL model with dropout learned a kernel inside, and the learning procedure was similar to GP regression. Adding noise into training data could help the model to learn more robust features from the training data, thereby leading to a better performance on test data with a wide noise level range. This study extends the understanding of the DL model to predict the wave excited motions of an offshore platform.
Abstract:Real-time motion prediction of a vessel or a floating platform can help to improve the performance of motion compensation systems. It can also provide useful early-warning information for offshore operations that are critical with regard to motion. In this study, a long short-term memory (LSTM) -based machine learning model was developed to predict heave and surge motions of a semi-submersible. The training and test data came from a model test carried out in the deep-water ocean basin, at Shanghai Jiao Tong University, China. The motion and measured waves were fed into LSTM cells and then went through serval fully connected (FC) layers to obtain the prediction. With the help of measured waves, the prediction extended 46.5 s into future with an average accuracy close to 90%. Using a noise-extended dataset, the trained model effectively worked with a noise level up to 0.8. As a further step, the model could predict motions only based on the motion itself. Based on sensitive studies on the architectures of the model, guidelines for the construction of the machine learning model are proposed. The proposed LSTM model shows a strong ability to predict vessel wave-excited motions.