Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kenji Hata

Real-time Autonomous Control of a Continuous Macroscopic Process as Demonstrated by Plastic Forming

Dec 14, 2023

Shun Muroga, Takashi Honda, Yasuaki Miki, Hideaki Nakajima, Don N. Futaba, Kenji Hata

Figure 1 for Real-time Autonomous Control of a Continuous Macroscopic Process as Demonstrated by Plastic Forming

Figure 2 for Real-time Autonomous Control of a Continuous Macroscopic Process as Demonstrated by Plastic Forming

Figure 3 for Real-time Autonomous Control of a Continuous Macroscopic Process as Demonstrated by Plastic Forming

Abstract:To meet the demands for more adaptable and expedient approaches to augment both research and manufacturing, we report an autonomous system using real-time in-situ characterization and an autonomous, decision-making processer based on an active learning algorithm. This system was applied to a plastic film forming system to highlight its efficiency and accuracy in determining the process conditions for specified target film dimensions, importantly, without any human intervention. Application of this system towards nine distinct film dimensions demonstrated the system ability to quickly determine the appropriate and stable process conditions (average 11 characterization-adjustment iterations, 19 minutes) and the ability to avoid traps, such as repetitive over-correction. Furthermore, comparison of the achieved film dimensions to the target values showed a high accuracy (R2 = 0.87, 0.90) for film width and thickness, respectively. In addition, the use of an active learning algorithm afforded our system to proceed optimization with zero initial training data, which was unavailable due to the complex relationships between the control factors (material supply rate, applied force, material viscosity) within the plastic forming process. As our system is intrinsically general and can be applied to any most material processes, these results have significant implications in accelerating both research and industrial processes.

* 18pages, 7figures

Via

Access Paper or Ask Questions

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Dec 05, 2023

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman

Figure 1 for Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Figure 2 for Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Figure 3 for Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Figure 4 for Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Abstract:Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

Via

Access Paper or Ask Questions

Learning to Detect Touches on Cluttered Tables

Apr 10, 2023

Norberto Adrian Goussies, Kenji Hata, Shruthi Prabhakara, Abhishek Amit, Tony Aube, Carl Cepress, Diana Chang, Li-Te Cheng, Horia Stefan Ciurdar, Mike Cleron(+21 more)

Abstract:We present a novel self-contained camera-projector tabletop system with a lamp form-factor that brings digital intelligence to our tables. We propose a real-time, on-device, learning-based touch detection algorithm that makes any tabletop interactive. The top-down configuration and learning-based algorithm makes our method robust to the presence of clutter, a main limitation of existing camera-projector tabletop systems. Our research prototype enables a set of experiences that combine hand interactions and objects present on the table. A video can be found at https://youtu.be/hElC_c25Fg8.

Via

Access Paper or Ask Questions

A Comprehensive and Versatile Multimodal Deep Learning Approach for Predicting Diverse Properties of Advanced Materials

Mar 29, 2023

Shun Muroga, Yasuaki Miki, Kenji Hata

Abstract:We present a multimodal deep learning (MDL) framework for predicting physical properties of a 10-dimensional acrylic polymer composite material by merging physical attributes and chemical data. Our MDL model comprises four modules, including three generative deep learning models for material structure characterization and a fourth model for property prediction. Our approach handles an 18-dimensional complexity, with 10 compositional inputs and 8 property outputs, successfully predicting 913,680 property data points across 114,210 composition conditions. This level of complexity is unprecedented in computational materials science, particularly for materials with undefined structures. We propose a framework to analyze the high-dimensional information space for inverse material design, demonstrating flexibility and adaptability to various materials and scales, provided sufficient data is available. This study advances future research on different materials and the development of more sophisticated models, drawing us closer to the ultimate goal of predicting all properties of all materials.

* 38 pages, 17 figures, 1 table

Via

Access Paper or Ask Questions

Agile Modeling: Image Classification with Domain Experts in the Loop

Feb 25, 2023

Otilia Stretcu, Edward Vendrow, Kenji Hata, Krishnamurthy Viswanathan, Vittorio Ferrari, Sasan Tavakkol, Wenlei Zhou, Aditya Avinash, Enming Luo, Neil Gordon Alldrin(+6 more)

Figure 1 for Agile Modeling: Image Classification with Domain Experts in the Loop

Figure 2 for Agile Modeling: Image Classification with Domain Experts in the Loop

Figure 3 for Agile Modeling: Image Classification with Domain Experts in the Loop

Figure 4 for Agile Modeling: Image Classification with Domain Experts in the Loop

Abstract:Machine learning is not readily accessible to domain experts from many fields, blocked by issues ranging from data mining to model training. We argue that domain experts should be at the center of the modeling process, and we introduce the "Agile Modeling" problem: the process of turning any visual concept from an idea into a well-trained ML classifier through a human-in-the-loop interaction driven by the domain expert in a way that minimizes domain expert time. We propose a solution to the problem that enables domain experts to create classifiers in real-time and build upon recent advances in image-text co-embeddings such as CLIP or ALIGN to implement it. We show the feasibility of this solution through live experiments with 14 domain experts, each modeling their own concept. Finally, we compare a domain expert driven process with the traditional crowdsourcing paradigm and find that difficult concepts see pronounced improvements with domain experts.

Via

Access Paper or Ask Questions

Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

Nov 26, 2019

Zeyu Wang, Klint Qinami, Yannis Karakozis, Kyle Genova, Prem Nair, Kenji Hata, Olga Russakovsky

Figure 1 for Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

Figure 2 for Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

Figure 3 for Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

Figure 4 for Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

Abstract:Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.

Via

Access Paper or Ask Questions

ActivityNet Challenge 2017 Summary

Oct 22, 2017

Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Ranjay Khrisna, Victor Escorcia, Kenji Hata, Shyamal Buch

Figure 1 for ActivityNet Challenge 2017 Summary

Figure 2 for ActivityNet Challenge 2017 Summary

Figure 3 for ActivityNet Challenge 2017 Summary

Figure 4 for ActivityNet Challenge 2017 Summary

Abstract:The ActivityNet Large Scale Activity Recognition Challenge 2017 Summary: results and challenge participants papers.

* 76 pages

Via

Access Paper or Ask Questions

Dense-Captioning Events in Videos

May 02, 2017

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles

Figure 1 for Dense-Captioning Events in Videos

Figure 2 for Dense-Captioning Events in Videos

Figure 3 for Dense-Captioning Events in Videos

Figure 4 for Dense-Captioning Events in Videos

Abstract:Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

* 16 pages, 16 figures

Via

Access Paper or Ask Questions

A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality

Nov 01, 2016

Kenji Hata, Ranjay Krishna, Li Fei-Fei, Michael S. Bernstein

Figure 1 for A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality

Figure 2 for A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality

Figure 3 for A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality

Figure 4 for A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality

Abstract:Microtask crowdsourcing is increasingly critical to the creation of extremely large datasets. As a result, crowd workers spend weeks or months repeating the exact same tasks, making it necessary to understand their behavior over these long periods of time. We utilize three large, longitudinal datasets of nine million annotations collected from Amazon Mechanical Turk to examine claims that workers fatigue or satisfice over these long periods, producing lower quality work. We find that, contrary to these claims, workers are extremely stable in their quality over the entire period. To understand whether workers set their quality based on the task's requirements for acceptance, we then perform an experiment where we vary the required quality for a large crowdsourcing task. Workers did not adjust their quality based on the acceptance threshold: workers who were above the threshold continued working at their usual quality level, and workers below the threshold self-selected themselves out of the task. Capitalizing on this consistency, we demonstrate that it is possible to predict workers' long-term quality using just a glimpse of their quality on the first five tasks.

* 10 pages, 11 figures, accepted CSCW 2017

Via

Access Paper or Ask Questions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Feb 23, 2016

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma(+2 more)

Figure 1 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Figure 2 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Figure 3 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Figure 4 for Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Abstract:Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

* 44 pages, 37 figures

Via

Access Paper or Ask Questions