Abstract:CLIP has demonstrated exceptional image-text matching capabilities due to its training on contrastive learning tasks. Past research has suggested that whereas CLIP effectively matches text to images when the matching can be achieved just by matching the text with the objects in the image, CLIP struggles when the matching depends on representing the relationship among the objects in the images (i.e., inferring relations). Previous attempts to address this limitation by training CLIP on relation detection datasets with only linguistic supervision have met with limited success. In this paper, we offer insights and practical methods to advance the field of relation inference from images. This paper approaches the task of creating a model that effectively detects relations among the objects in images by producing text and image embeddings that capture relationships through linguistic supervision. To this end, we propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which augments the COCO dataset, fine-tunes CLIP with hard negatives subject-relation-object triples and corresponding images, and introduces a novel loss function to improve relation detection. Evaluated on multiple CLIP-based models, our method significantly improves zero-shot relation inference accuracy in both frozen and fine-tuned settings, significantly outperforming CLIP and state-of-the-art models while generalizing well on unseen data.
Abstract:This study examines gender disparities in patent law by analyzing the textual content of patent applications. While prior research has primarily focused on the study of metadata (i.e., filing year or technological class), we employ machine learning and natural language processing techniques to derive latent information from patent texts. In particular, these methods are used to predict inventor gender based on textual characteristics. We find that gender can be identified with notable accuracy - even without knowing the inventor's name. This ability to discern gender through text suggests that anonymized patent examination - often proposed as a solution to mitigate disparities in patent grant rate - may not fully address gender-specific outcomes in securing a patent. Our analysis additionally identifies gendered differences in textual choices within patent documents and the fields in which inventors choose to work. These findings highlight the complex interaction between textual choices, gender, and success in securing a patent. As discussed herein, this raises critical questions about the efficacy of current proposals aimed at achieving gender parity and efficiency in the patent system.
Abstract:Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR errors, post-processing algorithms have been proposed based on natural language analysis and machine learning techniques such as neural networks. Neural network's disadvantage is the vast amount of manually labeled data required for training, which is often unavailable. This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data. The main research goal is to develop a method for automatically generating language and task-specific training data to improve the neural network results for OCR post-correction, and to investigate which type of dataset is the most effective for OCR post-correction of historical documents. To this end, a series of experiments using several datasets was conducted. The evaluation corpus was based on Hebrew newspapers from the JPress project. An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors. We found that training the network using the proposed method is more effective than using randomly generated errors. The results also show that the performance of the neural network for OCR post-correction strongly depends on the genre and area of the training data. Moreover, neural networks that were trained with the proposed method outperform other state-of-the-art neural networks for OCR post-correction and complex spellcheckers. These results may have practical implications for many digital humanities projects.
Abstract:One of the key AI tools for textual corpora exploration is natural language question-answering (QA). Unlike keyword-based search engines, QA algorithms receive and process natural language questions and produce precise answers to these questions, rather than long lists of documents that need to be manually scanned by the users. State-of-the-art QA algorithms based on DNNs were successfully employed in various domains. However, QA in the genealogical domain is still underexplored, while researchers in this field (and other fields in humanities and social sciences) can highly benefit from the ability to ask questions in natural language, receive concrete answers and gain insights hidden within large corpora. While some research has been recently conducted for factual QA in the genealogical domain, to the best of our knowledge, there is no previous research on the more challenging task of numerical aggregation QA (i.e., answering questions combining aggregation functions, e.g., count, average, max). Numerical aggregation QA is critical for distant reading and analysis for researchers (and the general public) interested in investigating cultural heritage domains. Therefore, in this study, we present a new end-to-end methodology for numerical aggregation QA for genealogical trees that includes: 1) an automatic method for training dataset generation; 2) a transformer-based table selection method, and 3) an optimized transformer-based numerical aggregation QA model. The findings indicate that the proposed architecture, GLOBE, outperforms the state-of-the-art models and pipelines by achieving 87% accuracy for this task compared to only 21% by current state-of-the-art models. This study may have practical implications for genealogical information centers and museums, making genealogical data research easy and scalable for experts as well as the general public.
Abstract:With the rising popularity of user-generated genealogical family trees, new genealogical information systems have been developed. State-of-the-art natural question answering algorithms use deep neural network (DNN) architecture based on self-attention networks. However, some of these models use sequence-based inputs and are not suitable to work with graph-based structure, while graph-based DNN models rely on high levels of comprehensiveness of knowledge graphs that is nonexistent in the genealogical domain. Moreover, these supervised DNN models require training datasets that are absent in the genealogical domain. This study proposes an end-to-end approach for question answering using genealogical family trees by: 1) representing genealogical data as knowledge graphs, 2) converting them to texts, 3) combining them with unstructured texts, and 4) training a trans-former-based question answering model. To evaluate the need for a dedicated approach, a comparison between the fine-tuned model (Uncle-BERT) trained on the auto-generated genealogical dataset and state-of-the-art question-answering models was per-formed. The findings indicate that there are significant differences between answering genealogical questions and open-domain questions. Moreover, the proposed methodology reduces complexity while increasing accuracy and may have practical implications for genealogical research and real-world projects, making genealogical data accessible to experts as well as the general public.
Abstract:Over the past few decades, large archives of paper-based historical documents, such as books and newspapers, have been digitized using the Optical Character Recognition (OCR) technology. Unfortunately, this broadly used technology is error-prone, especially when an OCRed document was written hundreds of years ago. Neural networks have shown great success in solving various text processing tasks, including OCR post-correction. The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets they require to learn from, especially for morphologically-rich languages like Hebrew. Moreover, it is not clear what are the optimal structure and values of hyperparameters (predefined parameters) of neural networks for OCR error correction in Hebrew due to its unique features. Furthermore, languages change across genres and periods. These changes may affect the accuracy of OCR post-correction neural network models. To overcome these challenges, we developed a new multi-phase method for generating artificial training datasets with OCR errors and hyperparameters optimization for building an effective neural network for OCR post-correction in Hebrew.
Abstract:Combining computational technologies and humanities is an ongoing effort aimed at making resources such as texts, images, audio, video, and other artifacts digitally available, searchable, and analyzable. In recent years, deep neural networks (DNN) dominate the field of automatic text analysis and natural language processing (NLP), in some cases presenting a super-human performance. DNNs are the state-of-the-art machine learning algorithms solving many NLP tasks that are relevant for Digital Humanities (DH) research, such as spell checking, language detection, entity extraction, author detection, question answering, and other tasks. These supervised algorithms learn patterns from a large number of "right" and "wrong" examples and apply them to new examples. However, using DNNs for analyzing the text resources in DH research presents two main challenges: (un)availability of training data and a need for domain adaptation. This paper explores these challenges by analyzing multiple use-cases of DH studies in recent literature and their possible solutions and lays out a practical decision model for DH experts for when and how to choose the appropriate deep learning approaches for their research. Moreover, in this paper, we aim to raise awareness of the benefits of utilizing deep learning models in the DH community.