Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eren Erdal Aksoy

REHEARSE-3D: A Multi-modal Emulated Rain Dataset for 3D Point Cloud De-raining

Apr 30, 2025

Abu Mohammed Raisuddin, Jesper Holmblad, Hamed Haghighi, Yuri Poledna, Maikol Funk Drechsler, Valentina Donzella, Eren Erdal Aksoy

Abstract:Sensor degradation poses a significant challenge in autonomous driving. During heavy rainfall, the interference from raindrops can adversely affect the quality of LiDAR point clouds, resulting in, for instance, inaccurate point measurements. This, in turn, can potentially lead to safety concerns if autonomous driving systems are not weather-aware, i.e., if they are unable to discern such changes. In this study, we release a new, large-scale, multi-modal emulated rain dataset, REHEARSE-3D, to promote research advancements in 3D point cloud de-raining. Distinct from the most relevant competitors, our dataset is unique in several respects. First, it is the largest point-wise annotated dataset, and second, it is the only one with high-resolution LiDAR data (LiDAR-256) enriched with 4D Radar point clouds logged in both daytime and nighttime conditions in a controlled weather environment. Furthermore, REHEARSE-3D involves rain-characteristic information, which is of significant value not only for sensor noise modeling but also for analyzing the impact of weather at a point level. Leveraging REHEARSE-3D, we benchmark raindrop detection and removal in fused LiDAR and 4D Radar point clouds. Our comprehensive study further evaluates the performance of various statistical and deep-learning models. Upon publication, the dataset and benchmark models will be made publicly available at: https://sporsho.github.io/REHEARSE3D.

Via

Access Paper or Ask Questions

Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder

Mar 15, 2025

Enes Erdogan, Eren Erdal Aksoy, Sanem Sariel

Abstract:Recognition of human manipulation actions in real-time is essential for safe and effective human-robot interaction and collaboration. The challenge lies in developing a model that is both lightweight enough for real-time execution and capable of generalization. While some existing methods in the literature can run in real-time, they struggle with temporal scalability, i.e., they fail to adapt to long-duration manipulations effectively. To address this, leveraging the generalizable scene graph representations, we propose a new Factorized Graph Sequence Encoder network that not only runs in real-time but also scales effectively in the temporal dimension, thanks to its factorized encoder architecture. Additionally, we introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings. Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3\% and 5.6\% improvement in F1-macro score on the KIT Bimanual Action (Bimacs) Dataset and Collaborative Action (CoAx) Dataset, respectively. Moreover, we conduct an extensive ablation study to validate our network design choices. Finally, we compare our model with its architecturally similar RGB-based model on the Bimacs dataset and show the limitations of this model in contrast to ours on such an object-centric manipulation dataset.

* 8 pages, 3 figures, 7 tables

Via

Access Paper or Ask Questions

Semantics-aware LiDAR-Only Pseudo Point Cloud Generation for 3D Object Detection

Sep 16, 2023

Tiago Cortinhal, Idriss Gouigah, Eren Erdal Aksoy

Abstract:Although LiDAR sensors are crucial for autonomous systems due to providing precise depth information, they struggle with capturing fine object details, especially at a distance, due to sparse and non-uniform data. Recent advances introduced pseudo-LiDAR, i.e., synthetic dense point clouds, using additional modalities such as cameras to enhance 3D object detection. We present a novel LiDAR-only framework that augments raw scans with denser pseudo point clouds by solely relying on LiDAR sensors and scene semantics, omitting the need for cameras. Our framework first utilizes a segmentation model to extract scene semantics from raw point clouds, and then employs a multi-modal domain translator to generate synthetic image segments and depth cues without real cameras. This yields a dense pseudo point cloud enriched with semantic information. We also introduce a new semantically guided projection method, which enhances detection performance by retaining only relevant pseudo points. We applied our framework to different advanced 3D object detection methods and reported up to 2.9% performance upgrade. We also obtained comparable results on the KITTI 3D object detection dataset, in contrast to other state-of-the-art LiDAR-only detectors.

Via

Access Paper or Ask Questions

FIVA: Facial Image and Video Anonymization and Anonymization Defense

Sep 08, 2023

Felix Rosberg, Eren Erdal Aksoy, Cristofer Englund, Fernando Alonso-Fernandez

Figure 1 for FIVA: Facial Image and Video Anonymization and Anonymization Defense

Figure 2 for FIVA: Facial Image and Video Anonymization and Anonymization Defense

Figure 3 for FIVA: Facial Image and Video Anonymization and Anonymization Defense

Figure 4 for FIVA: Facial Image and Video Anonymization and Anonymization Defense

Abstract:In this paper, we present a new approach for facial anonymization in images and videos, abbreviated as FIVA. Our proposed method is able to maintain the same face anonymization consistently over frames with our suggested identity-tracking and guarantees a strong difference from the original face. FIVA allows for 0 true positives for a false acceptance rate of 0.001. Our work considers the important security issue of reconstruction attacks and investigates adversarial noise, uniform noise, and parameter noise to disrupt reconstruction attacks. In this regard, we apply different defense and protection methods against these privacy threats to demonstrate the scalability of FIVA. On top of this, we also show that reconstruction attack models can be used for detection of deep fakes. Last but not least, we provide experimental results showing how FIVA can even enable face swapping, which is purely trained on a single target image.

* Accepted to ICCVW 2023 - DFAD 2023

Via

Access Paper or Ask Questions

Multimodal Detection and Identification of Robot Manipulation Failures

May 08, 2023

Arda Inceoglu, Eren Erdal Aksoy, Sanem Sariel

Figure 1 for Multimodal Detection and Identification of Robot Manipulation Failures

Figure 2 for Multimodal Detection and Identification of Robot Manipulation Failures

Figure 3 for Multimodal Detection and Identification of Robot Manipulation Failures

Figure 4 for Multimodal Detection and Identification of Robot Manipulation Failures

Abstract:An autonomous service robot should be able to interact with its environment safely and robustly without requiring human assistance. Unstructured environments are challenging for robots since the exact prediction of outcomes is not always possible. Even when the robot behaviors are well-designed, the unpredictable nature of physical robot-object interaction may prevent success in object manipulation. Therefore, execution of a manipulation action may result in an undesirable outcome involving accidents or damages to the objects or environment. Situation awareness becomes important in such cases to enable the robot to (i) maintain the integrity of both itself and the environment, (ii) recover from failed tasks in the short term, and (iii) learn to avoid failures in the long term. For this purpose, robot executions should be continuously monitored, and failures should be detected and classified appropriately. In this work, we focus on detecting and classifying both manipulation and post-manipulation phase failures using the same exteroception setup. We cover a diverse set of failure types for primary tabletop manipulation actions. In order to detect these failures, we propose FINO-Net [1], a deep multimodal sensor fusion based classifier network. Proposed network accurately detects and classifies failures from raw sensory data without any prior knowledge. In this work, we use our extended FAILURE dataset [1] with 99 new multimodal manipulation recordings and annotate them with their corresponding failure types. FINO-Net achieves 0.87 failure detection and 0.80 failure classification F1 scores. Experimental results show that proposed architecture is also appropriate for real-time use.

* arXiv admin note: text overlap with arXiv:2011.05817

Via

Access Paper or Ask Questions

Learning Failure Prevention Skills for Safe Robot Manipulation

May 04, 2023

Abdullah Cihan Ak, Eren Erdal Aksoy, Sanem Sariel

Figure 1 for Learning Failure Prevention Skills for Safe Robot Manipulation

Figure 2 for Learning Failure Prevention Skills for Safe Robot Manipulation

Figure 3 for Learning Failure Prevention Skills for Safe Robot Manipulation

Figure 4 for Learning Failure Prevention Skills for Safe Robot Manipulation

Abstract:Robots are more capable of achieving manipulation tasks for everyday activities than before. But the safety of manipulation skills that robots employ is still an open problem. Considering all possible failures during skill learning increases the complexity of the process and restrains learning an optimal policy. Beyond that, in unstructured environments, it is not easy to enumerate all possible failures beforehand. In the context of safe skill manipulation, we reformulate skills as base and failure prevention skills where base skills aim at completing tasks and failure prevention skills focus on reducing the risk of failures to occur. Then, we propose a modular and hierarchical method for safe robot manipulation by augmenting base skills by learning failure prevention skills with reinforcement learning, forming a skill library to address different safety risks. Furthermore, a skill selection policy that considers estimated risks is used for the robot to select the best control policy for safe manipulation. Our experiments show that the proposed method achieves the given goal while ensuring safety by preventing failures. We also show that with the proposed method, skill learning is feasible, novel failures are easily adaptable, and our safe manipulation tools can be transferred to the real environment.

* 8 pages with 8 figures. For additional materials, see https://air.cs.itu.edu.tr/projects/tubitak-119e436.html

Via

Access Paper or Ask Questions

Depth- and Semantics-aware Multi-modal Domain Translation: Generating 3D Panoramic Color Images from LiDAR Point Clouds

Feb 15, 2023

Tiago Cortinhal, Eren Erdal Aksoy

Figure 1 for Depth- and Semantics-aware Multi-modal Domain Translation: Generating 3D Panoramic Color Images from LiDAR Point Clouds

Figure 2 for Depth- and Semantics-aware Multi-modal Domain Translation: Generating 3D Panoramic Color Images from LiDAR Point Clouds

Figure 3 for Depth- and Semantics-aware Multi-modal Domain Translation: Generating 3D Panoramic Color Images from LiDAR Point Clouds

Figure 4 for Depth- and Semantics-aware Multi-modal Domain Translation: Generating 3D Panoramic Color Images from LiDAR Point Clouds

Abstract:This work presents a new depth- and semantics-aware conditional generative model, named TITAN-Next, for cross-domain image-to-image translation in a multi-modal setup between LiDAR and camera sensors. The proposed model leverages scene semantics as a mid-level representation and is able to translate raw LiDAR point clouds to RGB-D camera images by solely relying on semantic scene segments. We claim that this is the first framework of its kind and it has practical applications in autonomous vehicles such as providing a fail-safe mechanism and augmenting available data in the target image domain. The proposed model is evaluated on the large-scale and challenging Semantic-KITTI dataset, and experimental findings show that it considerably outperforms the original TITAN-Net and other strong baselines by 23.7$\%$ margin in terms of IoU.

Via

Access Paper or Ask Questions

FaceDancer: Pose- and Occlusion-Aware High Fidelity Face Swapping

Oct 19, 2022

Felix Rosberg, Eren Erdal Aksoy, Fernando Alonso-Fernandez, Cristofer Englund

Figure 1 for FaceDancer: Pose- and Occlusion-Aware High Fidelity Face Swapping

Figure 2 for FaceDancer: Pose- and Occlusion-Aware High Fidelity Face Swapping

Figure 3 for FaceDancer: Pose- and Occlusion-Aware High Fidelity Face Swapping

Figure 4 for FaceDancer: Pose- and Occlusion-Aware High Fidelity Face Swapping

Abstract:In this work, we present a new single-stage method for subject agnostic face swapping and identity transfer, named FaceDancer. We have two major contributions: Adaptive Feature Fusion Attention (AFFA) and Interpreted Feature Similarity Regularization (IFSR). The AFFA module is embedded in the decoder and adaptively learns to fuse attribute features and features conditioned on identity information without requiring any additional facial segmentation process. In IFSR, we leverage the intermediate features in an identity encoder to preserve important attributes such as head pose, facial expression, lighting, and occlusion in the target face, while still transferring the identity of the source face with high fidelity. We conduct extensive quantitative and qualitative experiments on various datasets and show that the proposed FaceDancer outperforms other state-of-the-art networks in terms of identityn transfer, while having significantly better pose preservation than most of the previous methods.

Via

Access Paper or Ask Questions

Semantic State Estimation in Cloth Manipulation Tasks

Mar 22, 2022

Georgies Tzelepis, Eren Erdal Aksoy, Júlia Borràs, Guillem Alenyà

Figure 1 for Semantic State Estimation in Cloth Manipulation Tasks

Figure 2 for Semantic State Estimation in Cloth Manipulation Tasks

Figure 3 for Semantic State Estimation in Cloth Manipulation Tasks

Figure 4 for Semantic State Estimation in Cloth Manipulation Tasks

Abstract:Understanding of deformable object manipulations such as textiles is a challenge due to the complexity and high dimensionality of the problem. Particularly, the lack of a generic representation of semantic states (e.g., \textit{crumpled}, \textit{diagonally folded}) during a continuous manipulation process introduces an obstacle to identify the manipulation type. In this paper, we aim to solve the problem of semantic state estimation in cloth manipulation tasks. For this purpose, we introduce a new large-scale fully-annotated RGB image dataset showing various human demonstrations of different complicated cloth manipulations. We provide a set of baseline deep networks and benchmark them on the problem of semantic state estimation using our proposed dataset. Furthermore, we investigate the scalability of our semantic state estimation framework in robot monitoring tasks of long and complex cloth manipulations.

Via

Access Paper or Ask Questions

A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Oct 25, 2021

Gamze Akyol, Sanem Sariel, Eren Erdal Aksoy

Figure 1 for A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Figure 2 for A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Figure 3 for A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Figure 4 for A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Abstract:Despite decades of research, understanding human manipulation activities is, and has always been, one of the most attractive and challenging research topics in computer vision and robotics. Recognition and prediction of observed human manipulation actions have their roots in the applications related to, for instance, human-robot interaction and robot learning from demonstration. The current research trend heavily relies on advanced convolutional neural networks to process the structured Euclidean data, such as RGB camera images. These networks, however, come with immense computational complexity to be able to process high dimensional raw data. Different from the related works, we here introduce a deep graph autoencoder to jointly learn recognition and prediction of manipulation tasks from symbolic scene graphs, instead of relying on the structured Euclidean data. Our network has a variational autoencoder structure with two branches: one for identifying the input graph type and one for predicting the future graphs. The input of the proposed network is a set of semantic graphs which store the spatial relations between subjects and objects in the scene. The network output is a label set representing the detected and predicted class types. We benchmark our new model against different state-of-the-art methods on two different datasets, MANIAC and MSRC-9, and show that our proposed model can achieve better performance. We also release our source code https://github.com/gamzeakyol/GNet.

* Accepted for publication in the Proceedings of 2021 20th International Conference on Advanced Robotics (ICAR)

Via

Access Paper or Ask Questions