Abstract:While tangible user interface has shown its power in naturally interacting with rigid or soft objects, users cannot conveniently use different types of granular materials as the interaction media. We introduce DipMe as a smart device to recognize the types of granular media in real time, which can be used to connect the granular materials in the physical world with various virtual content. Other than vision-based solutions, we propose a dip operation of our device and exploit the haptic signals to recognize different types of granular materials. With modern machine learning tools, we find the haptic signals from different granular media are distinguishable by DipMe. With the online granular object recognition, we build several tangible interactive applications, demonstrating the effects of DipMe in perceiving granular materials and its potential in developing a tangible user interface with granular objects as the new media.
Abstract:We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: \textit{controlled music generation} and \textit{post-production editing}. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music .
Abstract:Knowledge distillation based on student-teacher network is one of the mainstream solution paradigms for the challenging unsupervised Anomaly Detection task, utilizing the difference in representation capabilities of the teacher and student networks to implement anomaly localization. However, over-generalization of the student network to the teacher network may lead to negligible differences in representation capabilities of anomaly, thus affecting the detection effectiveness. Existing methods address the possible over-generalization by using differentiated students and teachers from the structural perspective or explicitly expanding distilled information from the content perspective, which inevitably result in an increased likelihood of underfitting of the student network and poor anomaly detection capabilities in anomaly center or edge. In this paper, we propose Dual-Modeling Decouple Distillation (DMDD) for the unsupervised anomaly detection. In DMDD, a Decouple Student-Teacher Network is proposed to decouple the initial student features into normality and abnormality features. We further introduce Dual-Modeling Distillation based on normal-anomaly image pairs, fitting normality features of anomalous image and the teacher features of the corresponding normal image, widening the distance between abnormality features and the teacher features in anomalous regions. Synthesizing these two distillation ideas, we achieve anomaly detection which focuses on both edge and center of anomaly. Finally, a Multi-perception Segmentation Network is proposed to achieve focused anomaly map fusion based on multiple attention. Experimental results on MVTec AD show that DMDD surpasses SOTA localization performance of previous knowledge distillation-based methods, reaching 98.85% on pixel-level AUC and 96.13% on PRO.
Abstract:The e-commerce platform has evolved rapidly due to its widespread popularity and convenience. Developing an e-commerce shopping assistant for customers is crucial to aiding them in quickly finding desired products and recommending precisely what they need. However, most previous shopping assistants face two main problems: (1) task-specificity, which necessitates the development of different models for various tasks, thereby increasing development costs and limiting effectiveness; and (2) poor generalization, where the trained model performs inadequately on up-to-date products. To resolve these issues, we employ Large Language Models (LLMs) to construct an omnipotent assistant, leveraging their adeptness at handling multiple tasks and their superior generalization capability. Nonetheless, LLMs lack inherent knowledge of e-commerce concepts. To address this, we create an instruction dataset comprising 65,000 samples and diverse tasks, termed as EshopInstruct. Through instruction tuning on our dataset, the assistant, named LLaSA, demonstrates the potential to function as an omnipotent assistant. Additionally, we propose various inference optimization strategies to enhance performance with limited inference resources. In the Amazon KDD Cup 2024 Challenge, our proposed method, LLaSA, achieved an overall ranking of 3rd place on ShopBench, including 57 tasks and approximately 20,000 questions, and we secured top-5 rankings in each track, especially in track4, where we achieved the best performance result among all student teams. Our extensive practices fully demonstrate that LLMs possess the great potential to be competent e-commerce shopping assistants.
Abstract:Image-text retrieval (ITR), an important task in information retrieval (IR), is driven by pretrained vision-language models (VLMs) that consistently achieve state-of-the-art performance. However, a significant challenge lies in the brittleness of existing ITR benchmarks. In standard datasets for the task, captions often provide broad summaries of scenes, neglecting detailed information about specific concepts. Additionally, the current evaluation setup assumes simplistic binary matches between images and texts and focuses on intra-modality rather than cross-modal relationships, which can lead to misinterpretations of model performance. Motivated by this gap, in this study, we focus on examining the brittleness of the ITR evaluation pipeline with a focus on concept granularity. We start by analyzing two common benchmarks, MS-COCO and Flickr30k, and compare them with their augmented versions, MS-COCO-FG and Flickr30k-FG, given a specified set of linguistic features capturing concept granularity. We discover that Flickr30k-FG and MS COCO-FG consistently achieve higher scores across all the selected features. To investigate the performance of VLMs on coarse and fine-grained datasets, we introduce a taxonomy of perturbations. We apply these perturbations to the selected datasets. We evaluate four state-of-the-art models - ALIGN, AltCLIP, CLIP, and GroupViT - on the standard and fine-grained datasets under zero-shot conditions, with and without the applied perturbations. The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. Moreover, the relative performance drop across all setups is consistent across all models and datasets, indicating that the issue lies within the benchmarks. We conclude the paper by providing an agenda for improving ITR evaluation pipelines.
Abstract:Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.
Abstract:Pretrained large models (PLMs), such as ChatGPT, have demonstrated remarkable performance across diverse tasks. However, the significant computational requirements of PLMs have discouraged most product teams from running or fine-tuning them. In such cases, to harness the exceptional performance of PLMs, one must rely on expensive APIs, thereby exacerbating the economic burden. Despite the overall inferior performance of small models, in specific distributions, they can achieve comparable or even superior results. Consequently, some input can be processed exclusively by small models. On the other hand, certain tasks can be broken down into multiple subtasks, some of which can be completed without powerful capabilities. Under these circumstances, small models can handle the simple subtasks, allowing large models to focus on challenging subtasks, thus improving the performance. We propose Data Shunt$^+$ (DS$^+$), a general paradigm for collaboration of small and large models. DS$^+$ not only substantially reduces the cost associated with querying large models but also effectively improves large models' performance. For instance, ChatGPT achieves an accuracy of $94.43\%$ on Amazon Product sentiment analysis, and DS$^+$ achieves an accuracy of $95.64\%$, while the cost has been reduced to only $31.18\%$. Besides, experiments also prove that the proposed collaborative-based paradigm can better inject specific task knowledge into PLMs compared to fine-tuning.
Abstract:To completely understand a document, the use of textual information is not enough. Understanding visual cues, such as layouts and charts, is also required. While the current state-of-the-art approaches for document understanding (both OCR-based and OCR-free) work well, a thorough analysis of their capabilities and limitations has not yet been performed. Therefore, in this work, we addresses the limitation of current VisualQA models when applied to charts and plots. To investigate shortcomings of the state-of-the-art models, we conduct a comprehensive behavioral analysis, using ChartQA as a case study. Our findings indicate that existing models particularly underperform in answering questions related to the chart's structural and visual context, as well as numerical information. To address these issues, we propose three simple pre-training tasks that enforce the existing model in terms of both structural-visual knowledge, as well as its understanding of numerical questions. We evaluate our pre-trained model (called MatCha-v2) on three chart datasets - both extractive and abstractive question datasets - and observe that it achieves an average improvement of 1.7% over the baseline model.
Abstract:Point cloud registration is a prerequisite for many applications in computer vision and robotics. Most existing methods focus on pairwise registration of two point clouds with high overlap. Although there have been some methods for low overlap cases, they struggle in degraded scenarios. This paper introduces a novel framework named L-PR, designed to register unordered low overlap multiview point clouds leveraging LiDAR fiducial markers. We refer to them as LiDAR fiducial markers, but they are the same as the popular AprilTag and ArUco markers, thin sheets of paper that do not affect the 3D geometry of the environment. We first propose an improved adaptive threshold marker detection method to provide robust detection results when the viewpoints among point clouds change dramatically. Then, we formulate the unordered multiview point cloud registration problem as a maximum a-posteriori (MAP) problem and develop a framework consisting of two levels of graphs to address it. The first-level graph, constructed as a weighted graph, is designed to efficiently and optimally infer initial values of scan poses from the unordered set. The second-level graph is constructed as a factor graph. By globally optimizing the variables on the graph, including scan poses, marker poses, and marker corner positions, we tackle the MAP problem. We conduct qualitative and quantitative experiments to demonstrate that the proposed method exhibits superiority over competitors in four aspects: registration accuracy, instance reconstruction quality, localization accuracy, and robustness to the degraded scene. To benefit the community, we open-source our method and dataset at https://github.com/yorklyb/LiDAR-SFM.
Abstract:We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.