Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanaz Seyedin

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Sep 17, 2024

S. Rohollah Hosseyni, Ali Ahmad Rahmani, S. Jamal Seyedmohammadi, Sanaz Seyedin, Arash Mohammadi

Figure 1 for BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Figure 2 for BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Figure 3 for BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Figure 4 for BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Abstract:Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on https://github.com/RohollahHS/BAD.

Via

Access Paper or Ask Questions

Human Action Recognition in Still Images Using ConViT

Jul 18, 2023

Seyed Rohollah Hosseyni, Hasan Taheri, Sanaz Seyedin, Ali Ahmad Rahmani

Figure 1 for Human Action Recognition in Still Images Using ConViT

Figure 2 for Human Action Recognition in Still Images Using ConViT

Figure 3 for Human Action Recognition in Still Images Using ConViT

Figure 4 for Human Action Recognition in Still Images Using ConViT

Abstract:Understanding the relationship between different parts of the image plays a crucial role in many visual recognition tasks. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in detecting single objects, they lack the capability to extract the relationship between various regions of an image, which is a crucial factor in human action recognition. To address this problem, this paper proposes a new module that functions like a convolutional layer using Vision Transformer (ViT). The proposed action recognition model comprises two components: the first part is a deep convolutional network that extracts high-level spatial features from the image, and the second component of the model utilizes a Vision Transformer that extracts the relationship between various regions of the image using the feature map generated by the CNN output. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mAP and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.

Via

Access Paper or Ask Questions