Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nayan Saxena

Bridging the Data Provenance Gap Across Text, Speech and Video

Dec 19, 2024

Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena(+33 more)

Figure 1 for Bridging the Data Provenance Gap Across Text, Speech and Video

Figure 2 for Bridging the Data Provenance Gap Across Text, Speech and Video

Figure 3 for Bridging the Data Provenance Gap Across Text, Speech and Video

Figure 4 for Bridging the Data Provenance Gap Across Text, Speech and Video

Abstract:Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.

* 10 pages, 5 figures (main paper)

Via

Access Paper or Ask Questions

Consent in Crisis: The Rapid Decline of the AI Data Commons

Jul 24, 2024

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter(+39 more)

Figure 1 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Figure 2 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Figure 3 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Figure 4 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Abstract:General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

* 41 pages (13 main), 5 figures, 9 tables

Via

Access Paper or Ask Questions

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Feb 28, 2024

Ethan Smith, Nayan Saxena, Aninda Saha

Figure 1 for ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Figure 2 for ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Figure 3 for ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Figure 4 for ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Abstract:Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.

Via

Access Paper or Ask Questions

Towards One Shot Search Space Poisoning in Neural Architecture Search

Nov 13, 2021

Nayan Saxena, Robert Wu, Rohan Jain

Figure 1 for Towards One Shot Search Space Poisoning in Neural Architecture Search

Figure 2 for Towards One Shot Search Space Poisoning in Neural Architecture Search

Figure 3 for Towards One Shot Search Space Poisoning in Neural Architecture Search

Figure 4 for Towards One Shot Search Space Poisoning in Neural Architecture Search

Abstract:We evaluate the robustness of a Neural Architecture Search (NAS) algorithm known as Efficient NAS (ENAS) against data agnostic poisoning attacks on the original search space with carefully designed ineffective operations. We empirically demonstrate how our one shot search space poisoning approach exploits design flaws in the ENAS controller to degrade predictive performance on classification tasks. With just two poisoning operations injected into the search space, we inflate prediction error rates for child networks upto 90% on the CIFAR-10 dataset.

* (Student Abstract) In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC,Canada, 2022. arXiv admin note: substantial text overlap with arXiv:2106.14406

Via

Access Paper or Ask Questions

NeuralArTS: Structuring Neural Architecture Search with Type Theory

Nov 05, 2021

Robert Wu, Nayan Saxena, Rohan Jain

Figure 1 for NeuralArTS: Structuring Neural Architecture Search with Type Theory

Figure 2 for NeuralArTS: Structuring Neural Architecture Search with Type Theory

Figure 3 for NeuralArTS: Structuring Neural Architecture Search with Type Theory

Abstract:Neural Architecture Search (NAS) algorithms automate the task of finding optimal deep learning architectures given an initial search space of possible operations. Developing these search spaces is usually a manual affair with pre-optimized search spaces being more efficient, rather than searching from scratch. In this paper we present a new framework called Neural Architecture Type System (NeuralArTS) that categorizes the infinite set of network operations in a structured type system. We further demonstrate how NeuralArTS can be applied to convolutional layers and propose several future directions.

* (Student Abstract) In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC,Canada, 2022

Via

Access Paper or Ask Questions

Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign Language

Nov 02, 2021

Steven Kolawole, Opeyemi Osakuade, Nayan Saxena, Babatunde Kazeem Olorisade

Figure 1 for Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign Language

Figure 2 for Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign Language

Figure 3 for Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign Language

Figure 4 for Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign Language

Abstract:Through this paper, we seek to reduce the communication barrier between the hearing-impaired community and the larger society who are usually not familiar with sign language in the sub-Saharan region of Africa with the largest occurrences of hearing disability cases, while using Nigeria as a case study. The dataset is a pioneer dataset for the Nigerian Sign Language and was created in collaboration with relevant stakeholders. We pre-processed the data in readiness for two different object detection models and a classification model and employed diverse evaluation metrics to gauge model performance on sign-language to text conversion tasks. Finally, we convert the predicted sign texts to speech and deploy the best performing model in a lightweight application that works in real-time and achieves impressive results converting sign words/phrases to text and subsequently, into speech.

Via

Access Paper or Ask Questions

Statistical Consequences of Dueling Bandits

Oct 16, 2021

Nayan Saxena, Pan Chen, Emmy Liu

Figure 1 for Statistical Consequences of Dueling Bandits

Figure 2 for Statistical Consequences of Dueling Bandits

Figure 3 for Statistical Consequences of Dueling Bandits

Figure 4 for Statistical Consequences of Dueling Bandits

Abstract:Multi-Armed-Bandit frameworks have often been used by researchers to assess educational interventions, however, recent work has shown that it is more beneficial for a student to provide qualitative feedback through preference elicitation between different alternatives, making a dueling bandits framework more appropriate. In this paper, we explore the statistical quality of data under this framework by comparing traditional uniform sampling to a dueling bandit algorithm and find that dueling bandit algorithms perform well at cumulative regret minimisation, but lead to inflated Type-I error rates and reduced power under certain circumstances. Through these results we provide insight into the challenges and opportunities in using dueling bandit algorithms to run adaptive experiments.

* In Workshop on Reinforcement Learning for Education, 14th International Conference on Educational Data Mining , Paris, France, 2021

Via

Access Paper or Ask Questions

Poisoning the Search Space in Neural Architecture Search

Jun 28, 2021

Robert Wu, Nayan Saxena, Rohan Jain

Figure 1 for Poisoning the Search Space in Neural Architecture Search

Figure 2 for Poisoning the Search Space in Neural Architecture Search

Figure 3 for Poisoning the Search Space in Neural Architecture Search

Figure 4 for Poisoning the Search Space in Neural Architecture Search

Abstract:Deep learning has proven to be a highly effective problem-solving tool for object detection and image segmentation across various domains such as healthcare and autonomous driving. At the heart of this performance lies neural architecture design which relies heavily on domain knowledge and prior experience on the researchers' behalf. More recently, this process of finding the most optimal architectures, given an initial search space of possible operations, was automated by Neural Architecture Search (NAS). In this paper, we evaluate the robustness of one such algorithm known as Efficient NAS (ENAS) against data agnostic poisoning attacks on the original search space with carefully designed ineffective operations. By evaluating algorithm performance on the CIFAR-10 dataset, we empirically demonstrate how our novel search space poisoning (SSP) approach and multiple-instance poisoning attacks exploit design flaws in the ENAS controller to result in inflated prediction error rates for child networks. Our results provide insights into the challenges to surmount in using NAS for more adversarially robust architecture search.

* All authors contributed equally. Appears in AdvML Workshop @ ICML2021: A Blessing in Disguise: The Prospects and Perils of Adversarial Machine Learning

Via

Access Paper or Ask Questions