Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Carlos Caetano

Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

Apr 20, 2025

Carlos Caetano, Gabriel O. dos Santos, Caio Petrucci, Artur Barros, Camila Laranjeira, Leo S. F. Ribeiro, Júlia F. de Mendonça, Jefersson A. dos Santos, Sandra Avila

Abstract:Including children's images in datasets has raised ethical concerns, particularly regarding privacy, consent, data protection, and accountability. These datasets, often built by scraping publicly available images from the Internet, can expose children to risks such as exploitation, profiling, and tracking. Despite the growing recognition of these issues, approaches for addressing them remain limited. We explore the ethical implications of using children's images in AI datasets and propose a pipeline to detect and remove such images. As a use case, we built the pipeline on a Vision-Language Model under the Visual Question Answering task and tested it on the #PraCegoVer dataset. We also evaluate the pipeline on a subset of 100,000 images from the Open Images V7 dataset to assess its effectiveness in detecting and removing images of children. The pipeline serves as a baseline for future research, providing a starting point for more comprehensive tools and methodologies. While we leverage existing models trained on potentially problematic data, our goal is to expose and address this issue. We do not advocate for training or deploying such models, but instead call for urgent community reflection and action to protect children's rights. Ultimately, we aim to encourage the research community to exercise - more than an additional - care in creating new datasets and to inspire the development of tools to protect the fundamental rights of vulnerable groups, particularly children.

* ACM Conference on Fairness, Accountability, and Transparency (FAccT 2025)

Via

Access Paper or Ask Questions

Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints

Sep 11, 2019

Carlos Caetano, François Brémond, William Robson Schwartz

Figure 1 for Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints

Figure 2 for Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints

Figure 3 for Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints

Figure 4 for Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints

Abstract:In the last years, the computer vision research community has studied on how to model temporal dynamics in videos to employ 3D human action recognition. To that end, two main baseline approaches have been researched: (i) Recurrent Neural Networks (RNNs) with Long-Short Term Memory (LSTM); and (ii) skeleton image representations used as input to a Convolutional Neural Network (CNN). Although RNN approaches present excellent results, such methods lack the ability to efficiently learn the spatial relations between the skeleton joints. On the other hand, the representations used to feed CNN approaches present the advantage of having the natural ability of learning structural information from 2D arrays (i.e., they learn spatial relations from the skeleton joints). To further improve such representations, we introduce the Tree Structure Reference Joints Image (TSRJI), a novel skeleton image representation to be used as input to CNNs. The proposed representation has the advantage of combining the use of reference joints and a tree structure skeleton. While the former incorporates different spatial relationships between the joints, the latter preserves important spatial relations by traversing a skeleton tree with a depth-first order algorithm. Experimental results demonstrate the effectiveness of the proposed representation for 3D action recognition on two datasets achieving state-of-the-art results on the recent NTU RGB+D~120 dataset.

* Conference on Graphics, Patterns and Images (SIBGRAPI2019). arXiv admin note: substantial text overlap with arXiv:1907.13025

Via

Access Paper or Ask Questions

SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition

Jul 30, 2019

Carlos Caetano, Jessica Sena, François Brémond, Jefersson A. dos Santos, William Robson Schwartz

Figure 1 for SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition

Figure 2 for SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition

Figure 3 for SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition

Figure 4 for SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition

Abstract:Due to the availability of large-scale skeleton datasets, 3D human action recognition has recently called the attention of computer vision community. Many works have focused on encoding skeleton data as skeleton image representations based on spatial structure of the skeleton joints, in which the temporal dynamics of the sequence is encoded as variations in columns and the spatial structure of each frame is represented as rows of a matrix. To further improve such representations, we introduce a novel skeleton image representation to be used as input of Convolutional Neural Networks (CNNs), named SkeleMotion. The proposed approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Different temporal scales are employed to compute motion values to aggregate more temporal dynamics to the representation making it able to capture longrange joint interactions involved in actions as well as filtering noisy motion values. Experimental results demonstrate the effectiveness of the proposed representation on 3D action recognition outperforming the state-of-the-art on NTU RGB+D 120 dataset.

* 16-th IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS2019)

Via

Access Paper or Ask Questions

Activity Recognition based on a Magnitude-Orientation Stream Network

Aug 22, 2017

Carlos Caetano, Victor H. C. de Melo, Jefersson A. dos Santos, William Robson Schwartz

Figure 1 for Activity Recognition based on a Magnitude-Orientation Stream Network

Figure 2 for Activity Recognition based on a Magnitude-Orientation Stream Network

Figure 3 for Activity Recognition based on a Magnitude-Orientation Stream Network

Figure 4 for Activity Recognition based on a Magnitude-Orientation Stream Network

Abstract:The temporal component of videos provides an important clue for activity recognition, as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the motion in a better and richer manner. Our method applies simple nonlinear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Experimental results, carried on two well-known datasets (HMDB51 and UCF101), demonstrate that using our proposed temporal stream as input to existing neural network architectures can improve their performance for activity recognition. Results demonstrate that our temporal stream provides complementary information able to improve the classical two-stream methods, indicating the suitability of our approach to be used as a temporal video representation.

* 8 pages, SIBGRAPI 2017

Via

Access Paper or Ask Questions

A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection

May 12, 2016

Carlos Caetano, Sandra Avila, William Robson Schwartz, Silvio Jamil F. Guimarães, Arnaldo de A. Araújo

Figure 1 for A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection

Figure 2 for A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection

Figure 3 for A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection

Figure 4 for A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection

Abstract:With the growing amount of inappropriate content on the Internet, such as pornography, arises the need to detect and filter such material. The reason for this is given by the fact that such content is often prohibited in certain environments (e.g., schools and workplaces) or for certain publics (e.g., children). In recent years, many works have been mainly focused on detecting pornographic images and videos based on visual content, particularly on the detection of skin color. Although these approaches provide good results, they generally have the disadvantage of a high false positive rate since not all images with large areas of skin exposure are necessarily pornographic images, such as people wearing swimsuits or images related to sports. Local feature based approaches with Bag-of-Words models (BoW) have been successfully applied to visual recognition tasks in the context of pornography detection. Even though existing methods provide promising results, they use local feature descriptors that require a high computational processing time yielding high-dimensional vectors. In this work, we propose an approach for pornography detection based on local binary feature extraction and BossaNova image representation, a BoW model extension that preserves more richly the visual information. Moreover, we propose two approaches for video description based on the combination of mid-level representations namely BossaNova Video Descriptor (BNVD) and BoW Video Descriptor (BoW-VD). The proposed techniques are promising, achieving an accuracy of 92.40%, thus reducing the classification error by 16% over the current state-of-the-art local features approach on the Pornography dataset.

* Manuscript accepted at Elsevier Neurocomputing

Via

Access Paper or Ask Questions