Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mengjiao Zhang

Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Oct 21, 2024

Mengjiao Zhang, Jia Xu

Figure 1 for Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Figure 2 for Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Figure 3 for Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Figure 4 for Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Abstract:While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with $256$ bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Via

Access Paper or Ask Questions

Retrieval Augmented Generation for Domain-specific Question Answering

Apr 23, 2024

Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte

Figure 1 for Retrieval Augmented Generation for Domain-specific Question Answering

Figure 2 for Retrieval Augmented Generation for Domain-specific Question Answering

Figure 3 for Retrieval Augmented Generation for Domain-specific Question Answering

Figure 4 for Retrieval Augmented Generation for Domain-specific Question Answering

Abstract:Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.

* AAAI 2024 (Association for the Advancement of Artificial Intelligence) Scientific Document Understanding Workshop

Via

Access Paper or Ask Questions