Abstract:The fast development of object detection techniques has attracted attention to developing efficient Deep Neural Networks (DNNs). However, the current state-of-the-art DNN models can not provide a balanced solution among accuracy, speed, and model size. This paper proposes an efficient real-time object detection framework on resource-constrained hardware devices through hardware and software co-design. The Tensor Train (TT) decomposition is proposed for compressing the YOLOv5 model. By unitizing the unique characteristics given by the TT decomposition, we develop an efficient hardware accelerator based on FPGA devices. Experimental results show that the proposed method can significantly reduce the model size and improve the execution time.
Abstract:This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.
Abstract:The simulation of large-scale systems with complex electron interactions remains one of the greatest challenges for the atomistic modeling of materials. Although classical force-fields often fail to describe the coupling between electronic states and ionic rearrangements, the more accurate \textit{ab-initio} molecular dynamics suffers from computational complexity that prevents long-time and large-scale simulations, which are essential to study many technologically relevant phenomena, such as reactions, ion migrations, phase transformations, and degradation. In this work, we present the Crystal Hamiltonian Graph neural Network (CHGNet) as a novel machine-learning interatomic potential (MLIP), using a graph-neural-network-based force-field to model a universal potential energy surface. CHGNet is pretrained on the energies, forces, stresses, and magnetic moments from the Materials Project Trajectory Dataset, which consists of over 10 years of density functional theory static and relaxation trajectories of $\sim 1.5$ million inorganic structures. The explicit inclusion of magnetic moments enables CHGNet to learn and accurately represent the orbital occupancy of electrons, enhancing its capability to describe both atomic and electronic degrees of freedom. We demonstrate several applications of CHGNet in solid-state materials, including charge-informed molecular dynamics in Li$_x$MnO$_2$, the finite temperature phase diagram for Li$_x$FePO$_4$ and Li diffusion in garnet conductors. We critically analyze the significance of including charge information for capturing appropriate chemistry, and we provide new insights into ionic systems with additional electronic degrees of freedom that can not be observed by previous MLIPs.
Abstract:To adhere to the stringent time and budget requirements of construction projects, contractors are utilizing prefabricated construction methods to expedite the construction process. Prefabricated construction methods require an adequate schedule and understanding by the contractors and constructors to be successful. The specificity of prefabricated construction often leads to inefficient scheduling and costly rework time. The designer, contractor, and constructors must have a strong understanding of the assembly process to experience the full benefits of the method. At the root of understanding the assembly process is visualizing how the process is intended to be performed. Currently, a virtual construction model is used to explain and better visualize the construction process. However, creating a virtual construction model is currently time consuming and requires experienced personnel. The proposed simulation of the virtual assembly will increase the automation of virtual construction modeling by implementing the data available in a building information modeling (BIM) model. This paper presents various factors (i.e., formalization of construction sequence based on the level of development (LOD)) that needs to be addressed for the development of automated virtual assembly. Two case studies are presented to demonstrate these factors.
Abstract:Research studies have shown that a large proportion of hazards remain unrecognized, which expose construction workers to unanticipated safety risks. Recent studies have also found that a strong correlation exists between viewing patterns of workers, captured using eye-tracking devices, and their hazard recognition performance. Therefore, it is important to analyze the viewing patterns of workers to gain a better understanding of their hazard recognition performance. This paper proposes a method that can automatically map the gaze fixations collected using a wearable eye-tracker to the predefined areas of interests. The proposed method detects these areas or objects (i.e., hazards) of interests through a computer vision-based segmentation technique and transfer learning. The mapped fixation data is then used to analyze the viewing behaviors of workers and compute their attention distribution. The proposed method is implemented on an under construction road as a case study to evaluate the performance of the proposed method.
Abstract:Camera-equipped unmanned vehicles (UVs) have received a lot of attention in data collection for construction monitoring applications. To develop an autonomous platform, the UV should be able to process multiple modules (e.g., context-awareness, control, localization, and mapping) on an embedded platform. Pixel-wise semantic segmentation provides a UV with the ability to be contextually aware of its surrounding environment. However, in the case of mobile robotic systems with limited computing resources, the large size of the segmentation model and high memory usage requires high computing resources, which a major challenge for mobile UVs (e.g., a small-scale vehicle with limited payload and space). To overcome this challenge, this paper presents a light and efficient deep neural network architecture to run on an embedded platform in real-time. The proposed model segments navigable space on an image sequence (i.e., a video stream), which is essential for an autonomous vehicle that is based on machine vision. The results demonstrate the performance efficiency of the proposed architecture compared to the existing models and suggest possible improvements that could make the model even more efficient, which is necessary for the future development of the autonomous robotics systems.
Abstract:Over the past few years, the use of camera-equipped robotic platforms for data collection and visually monitoring applications has exponentially grown. Cluttered construction sites with many objects (e.g., bricks, pipes, etc.) on the ground are challenging environments for a mobile unmanned ground vehicle (UGV) to navigate. To address this issue, this study presents a mobile UGV equipped with a stereo camera and a robotic arm that can remove obstacles along the UGV's path. To achieve this objective, the surrounding environment is captured by the stereo camera and obstacles are detected. The obstacle's relative location to the UGV is sent to the robotic arm module through Robot Operating System (ROS). Then, the robotic arm picks up and removes the obstacle. The proposed method will greatly enhance the degree of automation and the frequency of data collection for construction monitoring. The proposed system is validated through two case studies. The results successfully demonstrate the detection and removal of obstacles, serving as one of the enabling factors for developing an autonomous UGV with various construction operating applications.
Abstract:Unrecognized hazards increase the likelihood of workplace fatalities and injuries substantially. However, recent research has demonstrated that a large proportion of hazards remain unrecognized in dynamic construction environments. Recent studies have suggested a strong correlation between viewing patterns of workers and their hazard recognition performance. Hence, it is important to study and analyze the viewing patterns of workers to gain a better understanding of their hazard recognition performance. The objective of this exploratory research is to explore hazard recognition as a visual search process to identifying various visual search factors that affect the process of hazard recognition. Further, the study also proposes a framework to develop a vision based tool capable of recording and analyzing viewing patterns of construction workers and generate feedback for personalized training and proactive safety management.
Abstract:One of the major challenges of a real-time autonomous robotic system for construction monitoring is to simultaneously localize, map, and navigate over the lifetime of the robot, with little or no human intervention. Past research on Simultaneous Localization and Mapping (SLAM) and context-awareness are two active research areas in the computer vision and robotics communities. The studies that integrate both in real-time into a single modular framework for construction monitoring still need further investigation. A monocular vision system and real-time scene understanding are computationally heavy and the major state-of-the-art algorithms are tested on high-end desktops and/or servers with a high CPU- and/or GPU- computing capabilities, which affect their mobility and deployment for real-world applications. To address these challenges and achieve automation, this paper proposes an integrated robotic computer vision system, which generates a real-world spatial map of the obstacles and traversable space present in the environment in near real-time. This is done by integrating contextual Awareness and visual SLAM into a ground robotics agent. This paper presents the hardware utilization and performance of the aforementioned system for three different outdoor environments, which represent the applicability of this pipeline to diverse outdoor scenes in near real-time. The entire system is also self-contained and does not require user input, which demonstrates the potential of this computer vision system for autonomous navigation.