Abstract:This paper introduces the design and implementation of WiField, a WiFi sensing system deployed on COTS devices that can simultaneously identify multiple wavelength-level targets placed flexibly. Unlike traditional RF sensing schemes that focus on specific targets and RF links, WiField focuses on all media in the sensing area for the entire electric field. In this perspective, WiField provides a unified framework to finely characterize the diffraction, scattering, and other effects of targets at different positions, materials, and numbers on signals. The combination of targets in different positions, numbers, and sizes is just a special case. WiField proposed a scheme that utilizes phaseless data to complete the inverse mapping from electric field to material distribution, thereby achieving the simultaneous identification of multiple wavelength-level targets at any position and having the potential for deployment on a wide range of low-cost COTS devices. Our evaluation results show that it has an average identification accuracy of over 97% for 1-3 targets (5 cm * 10 cm in size) with different materials randomly placed within a 1.05 m * 1.05 m area.
Abstract:As intellectual property rights, the copyright protection of deep models is becoming increasingly important. Existing work has made many attempts at model watermarking and fingerprinting, but they have ignored homologous models trained with similar structures or training datasets. We highlight challenges in efficiently querying black-box piracy models to protect model copyrights without misidentifying homologous models. To address these challenges, we propose a novel method called DeepCore, which discovers that the classification confidence of the model is positively correlated with the distance of the predicted sample from the model decision boundary and piracy models behave more similarly at high-confidence classified sample points. Then DeepCore constructs core points far away from the decision boundary by optimizing the predicted confidence of a few sample points and leverages behavioral discrepancies between piracy and homologous models to identify piracy models. Finally, we design different model identification methods, including two similarity-based methods and a clustering-based method to identify piracy models using models' predictions of core points. Extensive experiments show the effectiveness of DeepCore in identifying various piracy models, achieving lower missed and false identification rates, and outperforming state-of-the-art methods.
Abstract:To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
Abstract:The significance of intelligent sensing systems is growing in the realm of smart services. These systems extract relevant signal features and generate informative representations for particular tasks. However, building the feature extraction component for such systems requires extensive domain-specific expertise or data. The exceptionally rapid development of foundation models is likely to usher in newfound abilities in such intelligent sensing. We propose a new scheme for sensing model, which we refer to as semi-generalist sensing model (SGSM). SGSM is able to semiautomatically solve various tasks using relatively less task-specific labeled data compared to traditional systems. Built through the analysis of the common theoretical model, SGSM can depict different modalities, such as the acoustic and Wi-Fi signal. Experimental results on such two heterogeneous sensors illustrate that SGSM functions across a wide range of scenarios, thereby establishing its broad applicability. In some cases, SGSM even achieves better performance than sensor-specific specialized solutions. Wi-Fi evaluations indicate a 20\% accuracy improvement when applying SGSM to an existing sensing model.
Abstract:Integrated Sensing and Communication (ISAC) is gradually becoming a reality due to the significant increase in frequency and bandwidth of next-generation wireless communication technologies. Therefore it becomes crucial to evaluate the communication and sensing performance using appropriate channel models to address resource competition from each other. Existing work only models the sensing capability based on the mutual information between the channel response and the received signal, and its theoretical resolution is difficult to support the high-precision requirements of ISAC for sensing tasks, and may even affect its communication optimal. In this paper, we propose a sensing channel encoder model to measure the sensing capacity with higher resolution by discrete task mutual information. For the first time, derive upper and lower bounds on the sensing accuracy for a given channel. This model not only provides the possibility of optimizing the ISAC systems at a finer granularity and balancing communication and sensing resources, but also provides theoretical explanations for classical intuitive feelings (like more modalities more accuracy) in wireless sensing. Furthermore, we validate the effectiveness of the proposed channel model through real-case studies, including person identification, displacement detection, direction estimation, and device recognition. The evaluation results indicate a Pearson correlation coefficient exceeding 0.9 between our task mutual information and conventional experimental metrics (e.g., accuracy).
Abstract:While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
Abstract:Speech conveys more information than just text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompt based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online\footnote{https://speechresearch.github.io/prompttts2}.
Abstract:In this paper, we investigate the streaming bandits problem, wherein the learner aims to minimize regret by dealing with online arriving arms and sublinear arm memory. We establish the tight worst-case regret lower bound of $\Omega \left( (TB)^{\alpha} K^{1-\alpha}\right), \alpha = 2^{B} / (2^{B+1}-1)$ for any algorithm with a time horizon $T$, number of arms $K$, and number of passes $B$. The result reveals a separation between the stochastic bandits problem in the classical centralized setting and the streaming setting with bounded arm memory. Notably, in comparison to the well-known $\Omega(\sqrt{KT})$ lower bound, an additional double logarithmic factor is unavoidable for any streaming bandits algorithm with sublinear memory permitted. Furthermore, we establish the first instance-dependent lower bound of $\Omega \left(T^{1/(B+1)} \sum_{\Delta_x>0} \frac{\mu^*}{\Delta_x}\right)$ for streaming bandits. These lower bounds are derived through a unique reduction from the regret-minimization setting to the sample complexity analysis for a sequence of $\epsilon$-optimal arms identification tasks, which maybe of independent interest. To complement the lower bound, we also provide a multi-pass algorithm that achieves a regret upper bound of $\tilde{O} \left( (TB)^{\alpha} K^{1 - \alpha}\right)$ using constant arm memory.
Abstract:Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.
Abstract:Unintended memorization of various information granularity has garnered academic attention in recent years, e.g. membership inference and property inference. How to inversely use this privacy leakage to facilitate real-world applications is a growing direction; the current efforts include dataset ownership inference and user auditing. Standing on the data lifecycle and ML model production, we propose an inference process named Data Provenance Inference, which is to infer the generation, collection or processing property of the ML training data, to assist ML developers in locating the training data gaps without maintaining strenuous metadata. We formularly define the data provenance and the data provenance inference task in ML training. Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow learning. Comprehensive evaluations cover language, visual and structured data in black-box and white-box settings, with diverse kinds of data provenance (i.e. business, county, movie, user). Our best inference accuracy achieves 98.96% in the white-box text model when "author" is the data provenance. The experimental results indicate that, in general, the inference performance positively correlated with the amount of reference data for inference, the depth and also the amount of the parameter of the accessed layer. Furthermore, we give a post-hoc statistical analysis of the data provenance definition to explain when our proposed method works well.