Abstract:Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected. Code and data will be shared at https://videodpo.github.io/.
Abstract:Coral reefs formulate the most valuable and productive marine ecosystems, providing habitat for many marine species. Coral reef surveying and analysis are currently confined to coral experts who invest substantial effort in generating comprehensive and dependable reports (\emph{e.g.}, coral coverage, population, spatial distribution, \textit{etc}), from the collected survey data. However, performing dense coral analysis based on manual efforts is significantly time-consuming, the existing coral analysis algorithms compromise and opt for performing down-sampling and only conducting sparse point-based coral analysis within selected frames. However, such down-sampling will \textbf{inevitable} introduce the estimation bias or even lead to wrong results. To address this issue, we propose to perform \textbf{dense coral video segmentation}, with no down-sampling involved. Through video object segmentation, we could generate more \textit{reliable} and \textit{in-depth} coral analysis than the existing coral reef analysis algorithms. To boost such dense coral analysis, we propose a large-scale coral video segmentation dataset: \textbf{CoralVOS} as demonstrated in Fig. 1. To the best of our knowledge, our CoralVOS is the first dataset and benchmark supporting dense coral video segmentation. We perform experiments on our CoralVOS dataset, including 6 recent state-of-the-art video object segmentation (VOS) algorithms. We fine-tuned these VOS algorithms on our CoralVOS dataset and achieved observable performance improvement. The results show that there is still great potential for further promoting the segmentation accuracy. The dataset and trained models will be released with the acceptance of this work to foster the coral reef research community.
Abstract:Marine object detection has gained prominence in marine research, driven by the pressing need to unravel oceanic mysteries and enhance our understanding of invaluable marine ecosystems. There is a profound requirement to efficiently and accurately identify and localize diverse and unseen marine entities within underwater imagery. The open-marine object detection (OMOD for short) is required to detect diverse and unseen marine objects, performing categorization and localization simultaneously. To achieve OMOD, we present \textbf{MarineDet}. We formulate a joint visual-text semantic space through pre-training and then perform marine-specific training to achieve in-air-to-marine knowledge transfer. Considering there is no specific dataset designed for OMOD, we construct a \textbf{MarineDet dataset} consisting of 821 marine-relative object categories to promote and measure OMOD performance. The experimental results demonstrate the superior performance of MarineDet over existing generalist and specialist object detection algorithms. To the best of our knowledge, we are the first to present OMOD, which holds a more valuable and practical setting for marine ecosystem monitoring and management. Our research not only pushes the boundaries of marine understanding but also offers a standard pipeline for OMOD.
Abstract:The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
Abstract:Underwater visual perception is essentially important for underwater exploration, archeology, ecosystem and so on. The low illumination, light reflections, scattering, absorption and suspended particles inevitably lead to the critically degraded underwater image quality, which causes great challenges on recognizing the objects from the underwater images. The existing underwater enhancement methods that aim to promote the underwater visibility, heavily suffer from the poor image restoration performance and generalization ability. To reduce the difficulty of underwater image enhancement, we introduce the media transmission map as guidance to assist in image enhancement. We formulate the interaction between the underwater visual images and the transmission map to obtain better enhancement results. Even with simple and lightweight network configuration, the proposed method can achieve advanced results of 22.6 dB on the challenging Test-R90 with an impressive 30 times faster than the existing models. Comprehensive experimental results have demonstrated the superiority and potential on underwater perception. Paper's code is offered on: https://github.com/GroupG-yk/MTUR-Net.