Abstract:Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, issues such as maintaining visual consistency, ensuring stylistic coherence, and addressing ethical considerations continue to pose challenges. Furthermore, this paper discusses future directions and explores potential advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
Abstract:Generalization is the core objective when training optimizers from data. However, limited training instances often constrain the generalization capability of the trained optimizers. Co-evolutionary approaches address this challenge by simultaneously evolving a parallel algorithm portfolio (PAP) and an instance population to eventually obtain PAPs with good generalization. Yet, when applied to a specific problem class, these approaches have a major limitation. They require practitioners to provide instance generators specially tailored to the problem class, which is often non-trivial to design. This work proposes a general-purpose, off-the-shelf PAP construction approach, named domain-agnostic co-evolution of parameterized search (DACE), for binary optimization problems where decision variables take values of 0 or 1. The key innovation of DACE lies in its neural network-based domain-agnostic instance representation and generation mechanism that delimitates the need for domain-specific instance generators. The strong generality of DACE is validated across three real-world binary optimization problems: the complementary influence maximization problem (CIMP), the compiler arguments optimization problem (CAOP), and the contamination control problem (CCP). Given only a small set of training instances from these classes, DACE, without requiring any domain knowledge, constructs PAPs with better generalization performance than existing approaches on all three classes, despite their use of domain-specific instance generators.
Abstract:Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error rates by two specific risk levels. Furthermore, we investigate semantic redundancy in prediction sets within open-ended contexts for the first time, leading to a promising evaluation metric for MLLMs based on average set size. Our comprehensive experiments across four Video Question-Answering (VideoQA) datasets utilizing eight MLLMs show that TRON achieves desired error rates bounded by two user-specified risk levels. Additionally, deduplicated prediction sets maintain adaptiveness while being more efficient and stable for risk assessment under different risk levels.
Abstract:During social interactions, understanding the intricacies of the context can be vital, particularly for socially anxious individuals. While previous research has found that the presence of a social interaction can be detected from ambient audio, the nuances within social contexts, which influence how anxiety provoking interactions are, remain largely unexplored. As an alternative to traditional, burdensome methods like self-report, this study presents a novel approach that harnesses ambient audio segments to detect social threat contexts. We focus on two key dimensions: number of interaction partners (dyadic vs. group) and degree of evaluative threat (explicitly evaluative vs. not explicitly evaluative). Building on data from a Zoom-based social interaction study (N=52 college students, of whom the majority N=45 are socially anxious), we employ deep learning methods to achieve strong detection performance. Under sample-wide 5-fold Cross Validation (CV), our model distinguished dyadic from group interactions with 90\% accuracy and detected evaluative threat at 83\%. Using a leave-one-group-out CV, accuracies were 82\% and 77\%, respectively. While our data are based on virtual interactions due to pandemic constraints, our method has the potential to extend to diverse real-world settings. This research underscores the potential of passive sensing and AI to differentiate intricate social contexts, and may ultimately advance the ability of context-aware digital interventions to offer personalized mental health support.
Abstract:This article explores the convergence of connectionist and symbolic artificial intelligence (AI), from historical debates to contemporary advancements. Traditionally considered distinct paradigms, connectionist AI focuses on neural networks, while symbolic AI emphasizes symbolic representation and logic. Recent advancements in large language models (LLMs), exemplified by ChatGPT and GPT-4, highlight the potential of connectionist architectures in handling human language as a form of symbols. The study argues that LLM-empowered Autonomous Agents (LAAs) embody this paradigm convergence. By utilizing LLMs for text-based knowledge modeling and representation, LAAs integrate neuro-symbolic AI principles, showcasing enhanced reasoning and decision-making capabilities. Comparing LAAs with Knowledge Graphs within the neuro-symbolic AI theme highlights the unique strengths of LAAs in mimicking human-like reasoning processes, scaling effectively with large datasets, and leveraging in-context samples without explicit re-training. The research underscores promising avenues in neuro-vector-symbolic integration, instructional encoding, and implicit reasoning, aimed at further enhancing LAA capabilities. By exploring the progression of neuro-symbolic AI and proposing future research trajectories, this work advances the understanding and development of AI technologies.
Abstract:Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model's unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.
Abstract:As the demand for programming skills grows across industries and academia, students often turn to Programming Online Judge (POJ) platforms for coding practice and competition. The difficulty level of each programming problem serves as an essential reference for guiding students' adaptive learning. However, current methods of determining difficulty levels either require extensive expert annotations or take a long time to accumulate enough student solutions for each problem. To address this issue, we formulate the problem of automatic difficulty level estimation of each programming problem, given its textual description and a solution example of code. For tackling this problem, we propose to couple two pre-trained models, one for text modality and the other for code modality, into a unified model. We built two POJ datasets for the task and the results demonstrate the effectiveness of the proposed approach and the contributions of both modalities.
Abstract:Real-world applications involve various discrete optimization problems. Designing a specialized optimizer for each of these problems is challenging, typically requiring significant domain knowledge and human efforts. Hence, developing general-purpose optimizers as an off-the-shelf tool for a wide range of problems has been a long-standing research target. This article introduces MEGO, a novel general-purpose neural optimizer trained through a fully data-driven learning-to-optimize (L2O) approach. MEGO consists of a mixture-of-experts trained on experiences from solving training problems and can be viewed as a foundation model for optimization problems with binary decision variables. When presented with a problem to solve, MEGO actively selects relevant expert models to generate high-quality solutions. MEGO can be used as a standalone sample-efficient optimizer or in conjunction with existing search methods as an initial solution generator. The generality of MEGO is validated across six problem classes, including three classic problem classes and three problem classes arising from real-world applications in compilers, network analysis, and 3D reconstruction. Trained solely on classic problem classes, MEGO performs very well on all six problem classes, significantly surpassing widely used general-purpose optimizers in both solution quality and efficiency. In some cases, MEGO even surpasses specialized state-of-the-art optimizers. Additionally, MEGO provides a similarity measure between problems, yielding a new perspective for problem classification. In the pursuit of general-purpose optimizers through L2O, MEGO represents an initial yet significant step forward.
Abstract:With the rapid advancements in artificial intelligence, the development of knowledgeable and personalized agents has become increasingly prevalent. However, the inherent variability in state variables and action spaces among personalized agents poses significant aggregation challenges for traditional federated learning algorithms. To tackle these challenges, we introduce the Federated Split Decision Transformer (FSDT), an innovative framework designed explicitly for AI agent decision tasks. The FSDT framework excels at navigating the intricacies of personalized agents by harnessing distributed data for training while preserving data privacy. It employs a two-stage training process, with local embedding and prediction models on client agents and a global transformer decoder model on the server. Our comprehensive evaluation using the benchmark D4RL dataset highlights the superior performance of our algorithm in federated split learning for personalized agents, coupled with significant reductions in communication and computational overhead compared to traditional centralized training approaches. The FSDT framework demonstrates strong potential for enabling efficient and privacy-preserving collaborative learning in applications such as autonomous driving decision systems. Our findings underscore the efficacy of the FSDT framework in effectively leveraging distributed offline reinforcement learning data to enable powerful multi-type agent decision systems.
Abstract:Uncertainty estimation plays a pivotal role in ensuring the reliability of safety-critical human-AI interaction systems, particularly in the medical domain. However, a general method for quantifying the uncertainty of free-form answers has yet to be established in open-ended medical question-answering (QA) tasks, where irrelevant words and sequences with limited semantic information can be the primary source of uncertainty due to the presence of generative inequality. In this paper, we propose the Word-Sequence Entropy (WSE), which calibrates the uncertainty proportion at both the word and sequence levels according to the semantic relevance, with greater emphasis placed on keywords and more relevant sequences when performing uncertainty quantification. We compare WSE with 6 baseline methods on 5 free-form medical QA datasets, utilizing 7 "off-the-shelf" large language models (LLMs), and show that WSE exhibits superior performance on accurate uncertainty measurement under two standard criteria for correctness evaluation (e.g., WSE outperforms existing state-of-the-art method by 3.23% AUROC on the MedQA dataset). Additionally, in terms of the potential for real-world medical QA applications, we achieve a significant enhancement in the performance of LLMs when employing sequences with lower uncertainty, identified by WSE, as final answers (e.g., +6.36% accuracy improvement on the COVID-QA dataset), without requiring any additional task-specific fine-tuning or architectural modifications.