Abstract:Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.
Abstract:The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
Abstract:This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.
Abstract:Objective: Strain elastography and shear wave elastography are two commonly used methods to quantify cervical elasticity; however, they have limitations. Strain elastography is effective in showing tissue elasticity distribution in a single image, but the absence of stress information causes difficulty in comparing the results acquired from different imaging sessions. Shear wave elastography is effective in measuring shear wave speed (an intrinsic tissue property correlated with elasticity) in relatively homogeneous tissue, such as in the liver. However, for inhomogeneous tissue in the cervix, the shear wave speed measurement is less robust. To overcome these limitations, we develop a quantitative cervical elastography system by adding a stress sensor to an ultrasound imaging system. Methods: In an imaging session for quantitative cervical elastography, we use the transvaginal ultrasound imaging system to record B-mode images of the cervix showing its deformation and use the stress sensor to record the probe-surface stress simultaneously. We develop a correlation-based automatic feature tracking algorithm to quantify the deformation, from which the strain is quantified. After each imaging session, we calibrate the stress sensor and transform its measurement to true stress. Applying a linear regression to the stress and strain, we obtain an approximation of the cervical Young's modulus. Results: We validate the accuracy and robustness of this elastography system using phantom experiments. Applying this system to pregnant participants, we observe significant softening of the cervix during pregnancy (p-value < 0.001) with the cervical Young's modulus decreasing 3.95% per week. We estimate that geometric mean values of cervical Young's moduli during the first (11 to 13 weeks), second, and third trimesters are 13.07 kPa, 7.59 kPa, and 4.40 kPa, respectively.