Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lorenzo Garattoni

Collective perception for tracking people with a robot swarm

Oct 09, 2024

Miquel Kegeleirs, David Garzón Ramos, Guillermo Legarda Herranz, Ilyes Gharbi, Jeanne Szpirer, Olivier Debeir, Ken Hasselmann, Lorenzo Garattoni, Gianpiero Francesca, Mauro Birattari

Abstract:Swarm perception refers to the ability of a robot swarm to utilize the perception capabilities of each individual robot, forming a collective understanding of the environment. Their distributed nature enables robot swarms to continuously monitor dynamic environments by maintaining a constant presence throughout the space.In this study, we present a preliminary experiment on the collective tracking of people using a robot swarm. The experiment was conducted in simulation across four different office environments, with swarms of varying sizes. The robots were provided with images sampled from a dataset of real-world office environment pictures.We measured the time distribution required for a robot to detect a person changing location and to propagate this information to increasing fractions of the swarm. The results indicate that robot swarms show significant promise in monitoring dynamic environments.

* Presented at ICRA@40, Rotterdam

Via

Access Paper or Ask Questions

Depth-based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Sep 17, 2024

Alessandro Simoni, Francesco Marchetti, Guido Borghi, Federico Becattini, Davide Davoli, Lorenzo Garattoni, Gianpiero Francesca, Lorenzo Seidenari, Roberto Vezzani

Figure 1 for Depth-based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Figure 2 for Depth-based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Figure 3 for Depth-based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Figure 4 for Depth-based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Abstract:Despite the recent advances in computer vision research, estimating the 3D human pose from single RGB images remains a challenging task, as multiple 3D poses can correspond to the same 2D projection on the image. In this context, depth data could help to disambiguate the 2D information by providing additional constraints about the distance between objects in the scene and the camera. Unfortunately, the acquisition of accurate depth data is limited to indoor spaces and usually is tied to specific depth technologies and devices, thus limiting generalization capabilities. In this paper, we propose a method able to leverage the benefits of depth information without compromising its broader applicability and adaptability in a predominantly RGB-camera-centric landscape. Our approach consists of a heatmap-based 3D pose estimator that, leveraging the paradigm of Privileged Information, is able to hallucinate depth information from the RGB frames given at inference time. More precisely, depth information is used exclusively during training by enforcing our RGB-based hallucination network to learn similar features to a backbone pre-trained only on depth data. This approach proves to be effective even when dealing with limited and small datasets. Experimental results reveal that the paradigm of Privileged Information significantly enhances the model's performance, enabling efficient extraction of depth information by using only RGB images.

* ECCV 2024 Workshop T-CAP: TOWARDS A COMPLETE ANALYSIS OF PEOPLE: FINE-GRAINED UNDERSTANDING FOR REAL-WORLD APPLICATIONS

Via

Access Paper or Ask Questions

Leveraging swarm capabilities to assist other systems

May 07, 2024

Miquel Kegeleirs, David Garzón Ramos, Guillermo Legarda Herranz, Ilyes Gharbi, Jeanne Szpirer, Ken Hasselmann, Lorenzo Garattoni, Gianpiero Francesca, Mauro Birattari

Abstract:Most studies in swarm robotics treat the swarm as an isolated system of interest. We argue that the prevailing view of swarms as self-sufficient, independent systems limits the scope of potential applications for swarm robotics. A robot swarm could act as a support in an heterogeneous system comprising other robots and/or human operators, in particular by quickly providing access to a large amount of data acquired in large unknown environments. Tasks such as target identification & tracking, scouting, or monitoring/surveillance could benefit from this approach.

* Presented at the "Breaking swarm stereotypes" workshop at ICRA 2024

Via

Access Paper or Ask Questions

LAC: Latent Action Composition for Skeleton-based Action Segmentation

Aug 31, 2023

Di Yang, Yaohui Wang, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Abstract:Skeleton-based action segmentation requires recognizing composable actions in untrimmed videos. Current approaches decouple this problem by first extracting local visual features from skeleton sequences and then processing them by a temporal model to classify frame-wise actions. However, their performances remain limited as the visual features cannot sufficiently express composable actions. In this context, we propose Latent Action Composition (LAC), a novel self-supervised framework aiming at learning from synthesized composable motions for skeleton-based action segmentation. LAC is composed of a novel generation module towards synthesizing new sequences. Specifically, we design a linear latent space in the generator to represent primitive motion. New composed motions can be synthesized by simply performing arithmetic operations on latent representations of multiple input skeleton sequences. LAC leverages such synthesized sequences, which have large diversity and complexity, for learning visual representations of skeletons in both sequence and frame spaces via contrastive learning. The resulting visual encoder has a high expressive power and can be effectively transferred onto action segmentation tasks by end-to-end fine-tuning without the need for additional temporal models. We conduct a study focusing on transfer-learning and we show that representations learned from pre-trained LAC outperform the state-of-the-art by a large margin on TSU, Charades, PKU-MMD datasets.

* ICCV 2023

Via

Access Paper or Ask Questions

Automatic off-line design of robot swarms: exploring the transferability of control software and design methods across different platforms

May 25, 2023

Miquel Kegeleirs, David Garzón Ramos, Lorenzo Garattoni, Gianpiero Francesca, Mauro Birattari

Abstract:Automatic off-line design is an attractive approach to implementing robot swarms. In this approach, a designer specifies a mission for the swarm, and an optimization process generates suitable control software for the individual robots through computer-based simulations. Most relevant literature has focused on effectively transferring control software from simulation to physical robots. For the first time, we investigate (i) whether control software generated via automatic design is transferable across robot platforms and (ii) whether the design methods that generate such control software are themselves transferable. We experiment with two ground mobile platforms with equivalent capabilities. Our measure of transferability is based on the performance drop observed when control software and/or design methods are ported from one platform to another. Results indicate that while the control software generated via automatic design is transferable in some cases, better performance can be achieved when a transferable method is directly applied to the new platform.

* ICRA 2023 Transferability in Robotics Workshop

Via

Access Paper or Ask Questions

Self-Supervised Video Representation Learning via Latent Time Navigation

May 10, 2023

Di Yang, Yaohui Wang, Quan Kong, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Figure 1 for Self-Supervised Video Representation Learning via Latent Time Navigation

Figure 2 for Self-Supervised Video Representation Learning via Latent Time Navigation

Figure 3 for Self-Supervised Video Representation Learning via Latent Time Navigation

Figure 4 for Self-Supervised Video Representation Learning via Latent Time Navigation

Abstract:Self-supervised video representation learning aimed at maximizing similarity between different temporal segments of one video, in order to enforce feature persistence over time. This leads to loss of pertinent information related to temporal relationships, rendering actions such as `enter' and `leave' to be indistinguishable. To mitigate this limitation, we propose Latent Time Navigation (LTN), a time-parameterized contrastive learning strategy that is streamlined to capture fine-grained motions. Specifically, we maximize the representation similarity between different video segments from one video, while maintaining their representations time-aware along a subspace of the latent representation code including an orthogonal basis to represent temporal changes. Our extensive experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification in fine-grained and human-oriented tasks (e.g., on Toyota Smarthome dataset). In addition, we demonstrate that our proposed model, when pre-trained on Kinetics-400, generalizes well onto the unseen real world video benchmark datasets UCF101 and HMDB51, achieving state-of-the-art performance in action recognition.

* AAAI 2023

Via

Access Paper or Ask Questions

Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection

Jan 19, 2023

Snehashis Majhi, Rui Dai, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Figure 1 for Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection

Figure 2 for Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection

Figure 3 for Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection

Figure 4 for Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection

Abstract:Video anomaly detection in surveillance systems with only video-level labels (i.e. weakly-supervised) is challenging. This is due to, (i) the complex integration of human and scene based anomalies comprising of subtle and sharp spatio-temporal cues in real-world scenarios, (ii) non-optimal optimization between normal and anomaly instances under weak supervision. In this paper, we propose a Human-Scene Network to learn discriminative representations by capturing both subtle and strong cues in a dissociative manner. In addition, a self-rectifying loss is also proposed that dynamically computes the pseudo temporal annotations from video-level labels for optimizing the Human-Scene Network effectively. The proposed Human-Scene Network optimized with self-rectifying loss is validated on three publicly available datasets i.e. UCF-Crime, ShanghaiTech and IITB-Corridor, outperforming recently reported state-of-the-art approaches on five out of the six scenarios considered.

Via

Access Paper or Ask Questions

PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects

May 18, 2022

Pengyuan Wang, HyunJun Jung, Yitong Li, Siyuan Shen, Rahul Parthasarathy Srikanth, Lorenzo Garattoni, Sven Meier, Nassir Navab, Benjamin Busam

Figure 1 for PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects

Figure 2 for PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects

Figure 3 for PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects

Figure 4 for PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects

Abstract:Object pose estimation is crucial for robotic applications and augmented reality. Beyond instance level 6D object pose estimation methods, estimating category-level pose and shape has become a promising trend. As such, a new research field needs to be supported by well-designed datasets. To provide a benchmark with high-quality ground truth annotations to the community, we introduce a multimodal dataset for category-level object pose estimation with photometrically challenging objects termed PhoCaL. PhoCaL comprises 60 high quality 3D models of household objects over 8 categories including highly reflective, transparent and symmetric objects. We developed a novel robot-supported multi-modal (RGB, depth, polarisation) data acquisition and annotation process. It ensures sub-millimeter accuracy of the pose for opaque textured, shiny and transparent objects, no motion blur and perfect camera synchronisation. To set a benchmark for our dataset, state-of-the-art RGB-D and monocular RGB methods are evaluated on the challenging scenes of PhoCaL.

* 11 pages

Via

Access Paper or Ask Questions

DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

Dec 06, 2021

Pengyuan Wang, Fabian Manhardt, Luca Minciullo, Lorenzo Garattoni, Sven Meie, Nassir Navab, Benjamin Busam

Figure 1 for DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

Figure 2 for DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

Figure 3 for DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

Figure 4 for DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration

Abstract:The ability to successfully grasp objects is crucial in robotics, as it enables several interactive downstream applications. To this end, most approaches either compute the full 6D pose for the object of interest or learn to predict a set of grasping points. While the former approaches do not scale well to multiple object instances or classes yet, the latter require large annotated datasets and are hampered by their poor generalization capabilities to new geometries. To overcome these shortcomings, we propose to teach a robot how to grasp an object with a simple and short human demonstration. Hence, our approach neither requires many annotated images nor is it restricted to a specific geometry. We first present a small sequence of RGB-D images displaying a human-object interaction. This sequence is then leveraged to build associated hand and object meshes that represent the depicted interaction. Subsequently, we complete missing parts of the reconstructed object shape and estimate the relative transformation between the reconstruction and the visible object in the scene. Finally, we transfer the a-priori knowledge from the relative pose between object and human hand with the estimate of the current object pose in the scene into necessary grasping instructions for the robot. Exhaustive evaluations with Toyota's Human Support Robot (HSR) in real and synthetic environments demonstrate the applicability of our proposed methodology and its advantage in comparison to previous approaches.

* Accepted by IROS 2021

Via

Access Paper or Ask Questions

UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Jul 19, 2021

Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Figure 1 for UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Figure 2 for UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Figure 3 for UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Figure 4 for UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Abstract:Action recognition based on skeleton data has recently witnessed increasing attention and progress. State-of-the-art approaches adopting Graph Convolutional networks (GCNs) can effectively extract features on human skeletons relying on the pre-defined human topology. Despite associated progress, GCN-based methods have difficulties to generalize across domains, especially with different human topological structures. In this context, we introduce UNIK, a novel skeleton-based action recognition method that is not only effective to learn spatio-temporal features on human skeleton sequences but also able to generalize across datasets. This is achieved by learning an optimal dependency matrix from the uniform distribution based on a multi-head attention mechanism. Subsequently, to study the cross-domain generalizability of skeleton-based action recognition in real-world videos, we re-evaluate state-of-the-art approaches as well as the proposed UNIK in light of a novel Posetics dataset. This dataset is created from Kinetics-400 videos by estimating, refining and filtering poses. We provide an analysis on how much performance improves on smaller benchmark datasets after pre-training on Posetics for the action classification task. Experimental results show that the proposed UNIK, with pre-training on Posetics, generalizes well and outperforms state-of-the-art when transferred onto four target action classification datasets: Toyota Smarthome, Penn Action, NTU-RGB+D 60 and NTU-RGB+D 120.

* Code is available at: https://github.com/YangDi666/UNIK

Via

Access Paper or Ask Questions