Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anh Truong

Low-Rank Adaptation of Neural Fields

Apr 22, 2025

Anh Truong, Ahmed H. Mahmoud, Mina Konaković Luković, Justin Solomon

Abstract:Processing visual data often involves small adjustments or sequences of changes, such as in image filtering, surface smoothing, and video storage. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields (NF) -- neural network parameterizations of visual or physical functions -- has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA to instance-specific neural fields, avoiding the need for large pre-trained models yielding a pipeline suitable for low-compute hardware. We validate our approach with experiments in image filtering, video compression, and geometry editing, demonstrating its effectiveness and versatility for representing neural field updates.

Via

Access Paper or Ask Questions

SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

Apr 07, 2025

Stephen Brade, Sam Anderson, Rithesh Kumar, Zeyu Jin, Anh Truong

Abstract:Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

Via

Access Paper or Ask Questions

VideoMix: Aggregating How-To Videos for Task-Oriented Learning

Mar 27, 2025

Saelyne Yang, Anh Truong, Juho Kim, Dingzeyu Li

Abstract:Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.

* In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25) 2025

Via

Access Paper or Ask Questions

TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks

Mar 12, 2024

Yuexi Chen, Vlad I. Morariu, Anh Truong, Zhicheng Liu

Abstract:Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.

* CHI 2024, supplementary materials: https://hdi.cs.umd.edu/papers/TutoAI_CHI24_Supp.pdf

Via

Access Paper or Ask Questions

Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

Dec 17, 2020

Anh Truong, Austin Walters, Jeremy Goodsitt

Figure 1 for Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

Figure 2 for Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

Figure 3 for Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

Figure 4 for Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

Abstract:Named Entity Recognition has been extensively investigated in many fields. However, the application of sensitive entity detection for production systems in financial institutions has not been well explored due to the lack of publicly available, labeled datasets. In this paper, we use internal and synthetic datasets to evaluate various methods of detecting NPI (Nonpublic Personally Identifiable) information commonly found within financial institutions, in both unstructured and structured data formats. Character-level neural network models including CNN, LSTM, BiLSTM-CRF, and CNN-CRF are investigated on two prediction tasks: (i) entity detection on multiple data formats, and (ii) column-wise entity prediction on tabular datasets. We compare these models with other standard approaches on both real and synthetic data, with respect to F1-score, precision, recall, and throughput. The real datasets include internal structured data and public email data with manually tagged labels. Our experimental results show that the CNN model is simple yet effective with respect to accuracy and throughput and thus, is the most suitable candidate model to be deployed in the production environment(s). Finally, we provide several lessons learned on data limitations, data labelling and the intrinsic overlap of data entities.

Via

Access Paper or Ask Questions

Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels

Oct 07, 2019

Daniel Y. Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh Agrawala, Christopher Ré, Kayvon Fatahalian

Figure 1 for Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels

Figure 2 for Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels

Figure 3 for Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels

Figure 4 for Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels

Abstract:Many real-world video analysis applications require the ability to identify domain-specific events in video, such as interviews and commercials in TV news broadcasts, or action sequences in film. Unfortunately, pre-trained models to detect all the events of interest in video may not exist, and training new models from scratch can be costly and labor-intensive. In this paper, we explore the utility of specifying new events in video in a more traditional manner: by writing queries that compose outputs of existing, pre-trained models. To write these queries, we have developed Rekall, a library that exposes a data model and programming model for compositional video event specification. Rekall represents video annotations from different sources (object detectors, transcripts, etc.) as spatiotemporal labels associated with continuous volumes of spacetime in a video, and provides operators for composing labels into queries that model new video events. We demonstrate the use of Rekall in analyzing video from cable TV news broadcasts, films, static-camera vehicular video streams, and commercial autonomous vehicle logs. In these efforts, domain experts were able to quickly (in a few hours to a day) author queries that enabled the accurate detection of new events (on par with, and in some cases much more accurate than, learned approaches) and to rapidly retrieve video clips for human-in-the-loop tasks such as video content curation and training data curation. Finally, in a user study, novice users of Rekall were able to author queries to retrieve new events in video given just one hour of query development time.

Via

Access Paper or Ask Questions

Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Sep 03, 2019

Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, C. Bayan Bruss, Reza Farivar

Figure 1 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Figure 2 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Figure 3 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Figure 4 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Abstract:There has been considerable growth and interest in industrial applications of machine learning (ML) in recent years. ML engineers, as a consequence, are in high demand across the industry, yet improving the efficiency of ML engineers remains a fundamental challenge. Automated machine learning (AutoML) has emerged as a way to save time and effort on repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis. In this paper, we investigate the current state of AutoML tools aiming to automate these tasks. We conduct various evaluations of the tools on many datasets, in different data segments, to examine their performance, and compare their advantages and disadvantages on different test cases.

Via

Access Paper or Ask Questions

Submodular Trajectory Optimization for Aerial 3D Scanning

Aug 04, 2017

Mike Roberts, Debadeepta Dey, Anh Truong, Sudipta Sinha, Shital Shah, Ashish Kapoor, Pat Hanrahan, Neel Joshi

Figure 1 for Submodular Trajectory Optimization for Aerial 3D Scanning

Figure 2 for Submodular Trajectory Optimization for Aerial 3D Scanning

Figure 3 for Submodular Trajectory Optimization for Aerial 3D Scanning

Figure 4 for Submodular Trajectory Optimization for Aerial 3D Scanning

Abstract:Drones equipped with cameras are emerging as a powerful tool for large-scale aerial 3D scanning, but existing automatic flight planners do not exploit all available information about the scene, and can therefore produce inaccurate and incomplete 3D models. We present an automatic method to generate drone trajectories, such that the imagery acquired during the flight will later produce a high-fidelity 3D model. Our method uses a coarse estimate of the scene geometry to plan camera trajectories that: (1) cover the scene as thoroughly as possible; (2) encourage observations of scene geometry from a diverse set of viewing angles; (3) avoid obstacles; and (4) respect a user-specified flight time budget. Our method relies on a mathematical model of scene coverage that exhibits an intuitive diminishing returns property known as submodularity. We leverage this property extensively to design a trajectory planning algorithm that reasons globally about the non-additive coverage reward obtained across a trajectory, jointly with the cost of traveling between views. We evaluate our method by using it to scan three large outdoor scenes, and we perform a quantitative evaluation using a photorealistic video game simulator.

* Accepted for publication at the International Conference on Computer Vision (ICCV) 2017; Supplementary video: http://www.youtube.com/watch?v=89fFmfVZSO8

Via

Access Paper or Ask Questions