Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shi Ye

An Empirical Study on Transfer Learning for Privilege Review

Dec 16, 2021

Haozhen Zhao, Shi Ye, Jingchao Yang

Figure 1 for An Empirical Study on Transfer Learning for Privilege Review

Figure 2 for An Empirical Study on Transfer Learning for Privilege Review

Figure 3 for An Empirical Study on Transfer Learning for Privilege Review

Figure 4 for An Empirical Study on Transfer Learning for Privilege Review

Abstract:Protecting privileged communications and data from inadvertent disclosure is a paramount task in the US legal practice. Traditionally counsels rely on keyword searching and manual review to identify privileged documents in cases. As data volumes increase, this approach becomes less and less defensible in costs. Machine learning methods have been used in identifying privilege documents. Given the generalizable nature of privilege in legal cases, we hypothesize that transfer learning can capitalize knowledge learned from existing labeled data to identify privilege documents without requiring labeling new training data. In this paper, we study both traditional machine learning models and deep learning models based on BERT for privilege document classification tasks in legal document review, and we examine the effectiveness of transfer learning in privilege model on three real world datasets with privilege labels. Our results show that BERT model outperforms the industry standard logistic regression algorithm and transfer learning models can achieve decent performance on datasets in same or close domains.

* 2021 IEEE International Conference on Big Data (Big Data)

Via

Access Paper or Ask Questions

Application of Deep Learning in Recognizing Bates Numbers and Confidentiality Stamping from Images

Feb 05, 2021

Christian J. Mahoney, Katie Jensen, Fusheng Wei, Haozhen Zhao, Han Qin, Shi Ye

Figure 1 for Application of Deep Learning in Recognizing Bates Numbers and Confidentiality Stamping from Images

Figure 2 for Application of Deep Learning in Recognizing Bates Numbers and Confidentiality Stamping from Images

Figure 3 for Application of Deep Learning in Recognizing Bates Numbers and Confidentiality Stamping from Images

Abstract:In eDiscovery, it is critical to ensure that each page produced in legal proceedings conforms with the requirements of court or government agency production requests. Errors in productions could have severe consequences in a case, putting a party in an adverse position. The volume of pages produced continues to increase, and tremendous time and effort has been taken to ensure quality control of document productions. This has historically been a manual and laborious process. This paper demonstrates a novel automated production quality control application which leverages deep learning-based image recognition technology to extract Bates Number and Confidentiality Stamping from legal case production images and validate their correctness. Effectiveness of the method is verified with an experiment using a real-world production data.

* 2020 IEEE International Conference on Big Data (Big Data)

Via

Access Paper or Ask Questions

Image Analytics for Legal Document Review: A Transfer Learning Approach

Dec 19, 2019

Nathaniel Huber-Fliflet, Fusheng Wei, Haozhen Zhao, Han Qin, Shi Ye, Amy Tsang

Figure 1 for Image Analytics for Legal Document Review: A Transfer Learning Approach

Figure 2 for Image Analytics for Legal Document Review: A Transfer Learning Approach

Figure 3 for Image Analytics for Legal Document Review: A Transfer Learning Approach

Figure 4 for Image Analytics for Legal Document Review: A Transfer Learning Approach

Abstract:Though technology assisted review in electronic discovery has been focusing on text data, the need of advanced analytics to facilitate reviewing multimedia content is on the rise. In this paper, we present several applications of deep learning in computer vision to Technology Assisted Review of image data in legal industry. These applications include image classification, image clustering, and object detection. We use transfer learning techniques to leverage established pretrained models for feature extraction and fine tuning. These applications are first of their kind in the legal industry for image document review. We demonstrate effectiveness of these applications with solving real world business challenges.

* 2019 IEEE International Conference on Big Data (Big Data)

Via

Access Paper or Ask Questions

Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Dec 19, 2019

Robert Keeling, Rishi Chhatwal, Nathaniel Huber-Fliflet, Jianping Zhang, Fusheng Wei, Haozhen Zhao, Shi Ye, Han Qin

Figure 1 for Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Figure 2 for Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Figure 3 for Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Figure 4 for Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Abstract:Research has shown that Convolutional Neural Networks (CNN) can be effectively applied to text classification as part of a predictive coding protocol. That said, most research to date has been conducted on data sets with short documents that do not reflect the variety of documents in real world document reviews. Using data from four actual reviews with documents of varying lengths, we compared CNN with other popular machine learning algorithms for text classification, including Logistic Regression, Support Vector Machine, and Random Forest. For each data set, classification models were trained with different training sample sizes using different learning algorithms. These models were then evaluated using a large randomly sampled test set of documents, and the results were compared using precision and recall curves. Our study demonstrates that CNN performed well, but that there was no single algorithm that performed the best across the combination of data sets and training sample sizes. These results will help advance research into the legal profession's use of machine learning algorithms that maximize performance.

* 2019 IEEE International Conference on Big Data (Big Data)

Via

Access Paper or Ask Questions

Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding

Jun 11, 2019

Christian J. Mahoney, Nathaniel Huber-Fliflet, Haozhen Zhao, Jianping Zhang, Peter Gronvall, Shi Ye

Figure 1 for Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding

Figure 2 for Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding

Figure 3 for Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding

Figure 4 for Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding

Abstract:Active learning is a popular methodology in text classification - known in the legal domain as "predictive coding" or "Technology Assisted Review" or "TAR" - due to its potential to minimize the required review effort to build effective classifiers. In this study, we use extensive experimentation to examine the impact of popular seed set selection strategies in active learning, within a predictive coding exercise, and evaluate different active learning strategies against well-researched continuous active learning strategies for the purpose of determining efficient training methods for classifying large populations quickly and precisely. We study how random sampling, keyword models and clustering based seed set selection strategies combined together with top-ranked, uncertain, random, recall inspired, and hybrid active learning document selection strategies affect the performance of active learning for predictive coding. We use the percentage of documents requiring review to reach 75% recall as the "benchmark" metric to evaluate and compare our approaches. In most cases we find that seed set selection methods have a minor impact, though they do show significant impact in lower richness data sets or when choosing a top-ranked active learning selection strategy. Our results also show that active learning selection strategies implementing uncertainty, random, or 75% recall selection strategies has the potential to reach the optimum active learning round much earlier than the popular continuous active learning approach (top-ranked selection). The results of our research shed light on the impact of active learning seed set selection strategies and also the effectiveness of the selection strategies for the following learning rounds. Legal practitioners can use the results of this study to enhance the efficiency, precision, and simplicity of their predictive coding process.

* 1st International Workshop on AI and Intelligent Assistance for Legal Professionals in the Digital Workplace (LegalAIIA) at The 17th International Conference on Artificial Intelligence and Law (ICAIL 2019)

Via

Access Paper or Ask Questions

Empirical Study of Deep Learning for Text Classification in Legal Document Review

Apr 03, 2019

Fusheng Wei, Han Qin, Shi Ye, Haozhen Zhao

Figure 1 for Empirical Study of Deep Learning for Text Classification in Legal Document Review

Figure 2 for Empirical Study of Deep Learning for Text Classification in Legal Document Review

Figure 3 for Empirical Study of Deep Learning for Text Classification in Legal Document Review

Figure 4 for Empirical Study of Deep Learning for Text Classification in Legal Document Review

Abstract:Predictive coding has been widely used in legal matters to find relevant or privileged documents in large sets of electronically stored information. It saves the time and cost significantly. Logistic Regression (LR) and Support Vector Machines (SVM) are two popular machine learning algorithms used in predictive coding. Recently, deep learning received a lot of attentions in many industries. This paper reports our preliminary studies in using deep learning in legal document review. Specifically, we conducted experiments to compare deep learning results with results obtained using a SVM algorithm on the four datasets of real legal matters. Our results showed that CNN performed better with larger volume of training dataset and should be a fit method in the text classification in legal industry.

* 2018 IEEE International Conference on Big Data (Big Data)

Via

Access Paper or Ask Questions

Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding

Mar 21, 2019

Christian J. Mahoney, Nathaniel Huber-Fliflet, Katie Jensen, Haozhen Zhao, Robert Neary, Shi Ye

Figure 1 for Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding

Figure 2 for Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding

Figure 3 for Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding

Figure 4 for Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding

Abstract:Training documents have a significant impact on the performance of predictive models in the legal domain. Yet, there is limited research that explores the effectiveness of the training document selection strategy - in particular, the strategy used to select the seed set, or the set of documents an attorney reviews first to establish an initial model. Since there is limited research on this important component of predictive coding, the authors of this paper set out to identify strategies that consistently perform well. Our research demonstrated that the seed set selection strategy can have a significant impact on the precision of a predictive model. Enabling attorneys with the results of this study will allow them to initiate the most effective predictive modeling process to comb through the terabytes of data typically present in modern litigation. This study used documents from four actual legal cases to evaluate eight different seed set selection strategies. Attorneys can use the results contained within this paper to enhance their approach to predictive coding.

* 2018 IEEE International Conference on Big Data

Via

Access Paper or Ask Questions