Abstract:Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications. Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders, combined with a CTC or attention-based recognition decoder. However, these approaches face several drawbacks: 1) single encoders typically focus on either local trajectories or visual regions, lacking the ability to dynamically capture relevant global features in challenging cases; 2) multi-stream encoders, while more comprehensive, suffer from complex structures and increased inference costs. To tackle this, we propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process. Col-OLHTR consists of a trajectory encoder, a Point-to-Spatial Alignment (P2SA) module, and an attention-based decoder. The P2SA module is designed to learn image-level spatial features through trajectory-encoded features and 2D rotary position embeddings. During training, an additional image-stream encoder-decoder is collaboratively trained to provide supervision for P2SA features. At inference, the extra streams are discarded, and only the P2SA module is used and merged before the decoder, simplifying the process while preserving high performance. Extensive experimental results on several OLHTR benchmarks demonstrate the state-of-the-art (SOTA) performance, proving the effectiveness and robustness of our design.
Abstract:Previous work on long-form audio generation using global-view diffusion or iterative generation demands significant training or inference costs. While recent advancements in multi-view joint diffusion for panoramic generation provide an efficient option, they struggle with spectrum generation with severe overlap distortions and high cross-view consistency costs. We initially explore this phenomenon through the connectivity inheritance of latent maps and uncover that averaging operations excessively smooth the high-frequency components of the latent map. To address these issues, we propose Swap Forward (SaFa), a frame-level latent swap framework that synchronizes multiple diffusions to produce a globally coherent long audio with more spectrum details in a forward-only manner. At its core, the bidirectional Self-Loop Latent Swap is applied between adjacent views, leveraging stepwise diffusion trajectory to adaptively enhance high-frequency components without disrupting low-frequency components. Furthermore, to ensure cross-view consistency, the unidirectional Reference-Guided Latent Swap is applied between the reference and the non-overlap regions of each subview during the early stages, providing centralized trajectory guidance. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based long audio generation models. Moreover, we find that it also adapts well to panoramic generation, achieving comparable state-of-the-art performance with greater efficiency and model generalizability. Project page is available at https://swapforward.github.io/.
Abstract:An accurate evolution model is crucial for effective control and in-depth study of fusion plasmas. Evolution methods based on physical models often encounter challenges such as insufficient robustness or excessive computational costs. Given the proven strong fitting capabilities of deep learning methods across various fields, including plasma research, this paper introduces a deep learning-based magnetic measurement evolution method named PaMMA-Net (Plasma Magnetic Measurements Incremental Accumulative Prediction Network). This network is capable of evolving magnetic measurements in tokamak discharge experiments over extended periods or, in conjunction with equilibrium reconstruction algorithms, evolving macroscopic parameters such as plasma shape. Leveraging a incremental prediction approach and data augmentation techniques tailored for magnetic measurements, PaMMA-Net achieves superior evolution results compared to existing studies. The tests conducted on real experimental data from EAST validate the high generalization capability of the proposed method.
Abstract:Automatic font generation remains a challenging research issue, primarily due to the vast number of Chinese characters, each with unique and intricate structures. Our investigation of previous studies reveals inherent bias capable of causing structural changes in characters. Specifically, when generating a Chinese character similar to, but different from, those in the training samples, the bias is prone to either correcting or ignoring these subtle variations. To address this concern, we propose a novel Skeleton and Font Generation Network (SFGN) to achieve a more robust Chinese character font generation. Our approach includes a skeleton builder and font generator. The skeleton builder synthesizes content features using low-resource text input, enabling our technique to realize font generation independently of content image inputs. Unlike previous font generation methods that treat font style as a global embedding, we introduce a font generator to align content and style features on the radical level, which is a brand-new perspective for font generation. Except for common characters, we also conduct experiments on misspelled characters, a substantial portion of which slightly differs from the common ones. Our approach visually demonstrates the efficacy of generated images and outperforms current state-of-the-art font generation methods. Moreover, we believe that misspelled character generation have significant pedagogical implications and verify such supposition through experiments. We used generated misspelled characters as data augmentation in Chinese character error correction tasks, simulating the scenario where students learn handwritten Chinese characters with the help of misspelled characters. The significantly improved performance of error correction tasks demonstrates the effectiveness of our proposed approach and the value of misspelled character generation.
Abstract:Compressing integer keys is a fundamental operation among multiple communities, such as database management (DB), information retrieval (IR), and high-performance computing (HPC). Recent advances in \emph{learned indexes} have inspired the development of \emph{learned compressors}, which leverage simple yet compact machine learning (ML) models to compress large-scale sorted keys. The core idea behind learned compressors is to \emph{losslessly} encode sorted keys by approximating them with \emph{error-bounded} ML models (e.g., piecewise linear functions) and using a \emph{residual array} to guarantee accurate key reconstruction. While the concept of learned compressors remains in its early stages of exploration, our benchmark results demonstrate that an SIMD-optimized learned compressor can significantly outperform state-of-the-art CPU-based compressors. Drawing on our preliminary experiments, this vision paper explores the potential of learned data compression to enhance critical areas in DBMS and related domains. Furthermore, we outline the key technical challenges that existing systems must address when integrating this emerging methodology.
Abstract:The primary objective of Optical Chemical Structure Recognition is to identify chemical structure images into corresponding markup sequences. However, the complex two-dimensional structures of molecules, particularly those with rings and multiple branches, present significant challenges for current end-to-end methods to learn one-dimensional markup directly. To overcome this limitation, we propose a novel Ring-Free Language (RFL), which utilizes a divide-and-conquer strategy to describe chemical structures in a hierarchical form. RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness while enhancing readability. This approach significantly reduces the learning difficulty for recognition models. Leveraging RFL, we propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings, along with a branch classification module for predicting branch information. Experimental results demonstrate that the proposed RFL and MSD can be applied to various mainstream methods, achieving superior performance compared to state-of-the-art approaches in both printed and handwritten scenarios. The code is available at https://github.com/JingMog/RFL-MSD.
Abstract:As the demand for underwater communication continues to grow, underwater acoustic RIS (UARIS), as an emerging paradigm in underwater acoustic communication (UAC), can significantly improve the communication rate of underwater acoustic systems. However, in open underwater environments, the location of the source node is highly susceptible to being obtained by eavesdropping nodes through correlation analysis, leading to the issue of location privacy in underwater communication systems, which has been overlooked by many studies. To enhance underwater communication and protect location privacy, we propose a novel UARIS architecture integrated with an artificial noise (AN) module. This architecture not only improves communication quality but also introduces noise to interfere with the eavesdroppers' attempts to locate the source node. We derive the Cram\'er-Rao Lower Bound (CRLB) for the localization method deployed by the eavesdroppers and, based on this, model the UARIS-assisted communication scenario as a multi-objective optimization problem. This problem optimizes transmission beamforming, reflective precoding, and noise factors to maximize communication performance and location privacy protection. To efficiently solve this non-convex optimization problem, we develop an iterative algorithm based on fractional programming. Simulation results validate that the proposed system significantly enhances data transmission rates while effectively maintaining the location privacy of the source node in UAC systems.
Abstract:Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.
Abstract:Sound event localization and detection with source distance estimation (3D SELD) involves not only identifying the sound category and its direction-of-arrival (DOA) but also predicting the source's distance, aiming to provide full information about the sound position. This paper proposes a multi-stage video attention network (MVANet) for audio-visual (AV) 3D SELD. Multi-stage audio features are used to adaptively capture the spatial information of sound sources in videos. We propose a novel output representation that combines the DOA with distance of sound sources by calculating the real Cartesian coordinates to address the newly introduced source distance estimation (SDE) task in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. We also employ a variety of effective data augmentation and pre-training methods. Experimental results on the STARSS23 dataset have proven the effectiveness of our proposed MVANet. By integrating the aforementioned techniques, our system outperforms the top-ranked method we used in the AV 3D SELD task of the DCASE 2024 Challenge without model ensemble. The code will be made publicly available in the future.
Abstract:We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art performance on single-channel speech recognition.