Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhaowen Wang

CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research

Mar 24, 2025

Monan Zhou, Shenyang Xu, Zhaorui Liu, Zhaowen Wang, Feng Yu, Wei Li, Baoqiang Han

Abstract:Data are crucial in various computer-related fields, including music information retrieval (MIR), an interdisciplinary area bridging computer science and music. This paper introduces CCMusic, an open and diverse database comprising multiple datasets specifically designed for tasks related to Chinese music, highlighting our focus on this culturally rich domain. The database integrates both published and unpublished datasets, with steps taken such as data cleaning, label refinement, and data structure unification to ensure data consistency and create ready-to-use versions. We conduct benchmark evaluations for all datasets using a unified evaluation framework developed specifically for this purpose. This publicly available framework supports both classification and detection tasks, ensuring standardized and reproducible results across all datasets. The database is hosted on HuggingFace and ModelScope, two open and multifunctional data and model hosting platforms, ensuring ease of accessibility and usability.

* Transactions of the International Society for Music Information Retrieval, 2025, 8(1), 22-38
* 17 pages, 18 figures

Via

Access Paper or Ask Questions

Phase Error Sensitivity to Injection Signals in Multi-Phase Injection-Locked Ring Oscillators

Jan 03, 2025

Zhaowen Wang

Figure 1 for Phase Error Sensitivity to Injection Signals in Multi-Phase Injection-Locked Ring Oscillators

Figure 2 for Phase Error Sensitivity to Injection Signals in Multi-Phase Injection-Locked Ring Oscillators

Figure 3 for Phase Error Sensitivity to Injection Signals in Multi-Phase Injection-Locked Ring Oscillators

Figure 4 for Phase Error Sensitivity to Injection Signals in Multi-Phase Injection-Locked Ring Oscillators

Abstract:Multi-phase injection-locked ring oscillators (MP-ILROs) are widely used for multi-phase clock generation, with their phase accuracy primarily determined by the inherent accuracy of the oscillator itself, due to the suppression of input signal errors. However, a quantitative analysis of the oscillator's sensitivity to input errors remains largely unexplored. This paper presents a phasor-based analysis of injection locking, revealing that the phase error sensitivity is influenced by factors such as injection strength and the free-running frequency of the oscillator. Simulation results align closely with theoretical calculations, validating the effectiveness of the proposed method.

Via

Access Paper or Ask Questions

Amplitude-to-Phase Conversion in Injection-Locked CMOS Ring Oscillators

Dec 28, 2024

Zhaowen Wang

Abstract:Injection-locked ring oscillators (ILROs) are extensively employed for multi-phase clock generation in wireline and optical links. However, existing injection-locking theorems primarily rely on linearized phase-domain or nonlinear time-domain models, which fail to account for amplitude-to-phase conversion effects inherent in ILROs. This paper introduces an enhanced analytical model based on an extension of Adler's equation, explicitly incorporating amplitude-to-phase conversion. Simulation results demonstrate strong alignment with the proposed analytical predictions, validating the model's accuracy in capturing the locking range and phasor relationships.

Via

Access Paper or Ask Questions

WAS: Dataset and Methods for Artistic Text Segmentation

Jul 31, 2024

Xudong Xie, Yuzhe Li, Yang Liu, Zhifei Zhang, Zhaowen Wang, Wei Xiong, Xiang Bai

Figure 1 for WAS: Dataset and Methods for Artistic Text Segmentation

Figure 2 for WAS: Dataset and Methods for Artistic Text Segmentation

Figure 3 for WAS: Dataset and Methods for Artistic Text Segmentation

Figure 4 for WAS: Dataset and Methods for Artistic Text Segmentation

Abstract:Accurate text segmentation results are crucial for text-related generative tasks, such as text image generation, text editing, text removal, and text style transfer. Recently, some scene text segmentation methods have made significant progress in segmenting regular text. However, these methods perform poorly in scenarios containing artistic text. Therefore, this paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset. One challenge of the task is that the local stroke shapes of artistic text are changeable with diversity and complexity. We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes. Another challenge is the complexity of the global topological structure. We further design a skeleton-assisted head to guide the model to focus on the global structure. Additionally, to enhance the generalization performance of the text segmentation model, we propose a strategy for training data synthesis, based on the large multi-modal model and the diffusion model. Experimental results show that our proposed method and synthetic dataset can significantly enhance the performance of artistic text segmentation and achieve state-of-the-art results on other public datasets.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions

Scaling Up Video Summarization Pretraining with Large Language Models

Apr 04, 2024

Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

Abstract:Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset, we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field, our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Nov 30, 2023

Linzi Xing, Quan Tran, Fabian Caba, Franck Dernoncourt, Seunghyun Yoon, Zhaowen Wang, Trung Bui, Giuseppe Carenini

Figure 1 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Figure 2 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Figure 3 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Figure 4 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Abstract:Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.

* Accepted at the 30th International Conference on Multimedia Modeling (MMM 2024)

Via

Access Paper or Ask Questions

DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation

May 17, 2023

Ying-Tian Liu, Zhifei Zhang, Yuan-Chen Guo, Matthew Fisher, Zhaowen Wang, Song-Hai Zhang

Figure 1 for DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation

Figure 2 for DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation

Figure 3 for DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation

Figure 4 for DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation

Abstract:Automatic generation of fonts can be an important aid to typeface design. Many current approaches regard glyphs as pixelated images, which present artifacts when scaling and inevitable quality losses after vectorization. On the other hand, existing vector font synthesis methods either fail to represent the shape concisely or require vector supervision during training. To push the quality of vector font synthesis to the next level, we propose a novel dual-part representation for vector glyphs, where each glyph is modeled as a collection of closed "positive" and "negative" path pairs. The glyph contour is then obtained by boolean operations on these paths. We first learn such a representation only from glyph images and devise a subsequent contour refinement step to align the contour with an image representation to further enhance details. Our method, named DualVector, outperforms state-of-the-art methods in vector font synthesis both quantitatively and qualitatively. Our synthesized vector fonts can be easily converted to common digital font formats like TrueType Font for practical use. The code is released at https://github.com/thuliu-yt16/dualvector.

* CVPR 2023

Via

Access Paper or Ask Questions

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Apr 12, 2023

Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian Price, Shiyu Chang

Figure 1 for Improving Diffusion Models for Scene Text Editing with Dual Encoders

Figure 2 for Improving Diffusion Models for Scene Text Editing with Dual Encoders

Figure 3 for Improving Diffusion Models for Scene Text Editing with Dual Encoders

Figure 4 for Improving Diffusion Models for Scene Text Editing with Dual Encoders

Abstract:Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE

* 22 pages, 19 figures

Via

Access Paper or Ask Questions

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

Mar 13, 2023

Bo He, Jun Wang, Jielin Qiu, Trung Bui, Abhinav Shrivastava, Zhaowen Wang

Abstract:The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at ~\url{https://boheumd.github.io/A2Summ/}.

* Accepted at CVPR2023

Via

Access Paper or Ask Questions

LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Oct 12, 2022

Jielin Qiu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Ding Zhao, Hailin Jin

Figure 1 for LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Figure 2 for LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Figure 3 for LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Figure 4 for LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Abstract:Livestream videos have become a significant part of online learning, where design, digital marketing, creative painting, and other skills are taught by experienced experts in the sessions, making them valuable materials. However, Livestream tutorial videos are usually hours long, recorded, and uploaded to the Internet directly after the live sessions, making it hard for other people to catch up quickly. An outline will be a beneficial solution, which requires the video to be temporally segmented according to topics. In this work, we introduced a large Livestream video dataset named MultiLive, and formulated the temporal segmentation of the long Livestream videos (TSLLV) task. We propose LiveSeg, an unsupervised Livestream video temporal Segmentation solution, which takes advantage of multimodal features from different domains. Our method achieved a $16.8\%$ F1-score performance improvement compared with the state-of-the-art method.

Via

Access Paper or Ask Questions