Abstract:Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
Abstract:Wood-leaf classification is an essential and fundamental prerequisite in the analysis and estimation of forest attributes from terrestrial laser scanning (TLS) point clouds,including critical measurements such as diameter at breast height(DBH),above-ground biomass(AGB),wood volume.To address this,we introduce the Wood-Leaf Classification Network(WLC-Net),a deep learning model derived from PointNet++,designed to differentiate between wood and leaf points within tree point clouds.WLC-Net enhances classification accuracy,completeness,and speed by incorporating linearity as an inherent feature,refining the input-output framework,and optimizing the centroid sampling technique.WLC-Net was trained and assessed using three distinct tree species datasets,comprising a total of 102 individual tree point clouds:21 Chinese ash trees,21 willow trees,and 60 tropical trees.For comparative evaluation,five alternative methods,including PointNet++,DGCNN,Krishna Moorthy's method,LeWoS, and Sun's method,were also applied to these datasets.The classification accuracy of all six methods was quantified using three metrics:overall accuracy(OA),mean Intersection over Union(mIoU),and F1-score.Across all three datasets,WLC-Net demonstrated superior performance, achieving OA scores of 0.9778, 0.9712, and 0.9508;mIoU scores of 0.9761, 0.9693,and 0.9141;and F1-scores of 0.8628, 0.7938,and 0.9019,respectively.The time costs of WLC-Net were also recorded to evaluate the efficiency.The average processing time was 102.74s per million points for WLC-Net.In terms of visual inspect,accuracy evaluation and efficiency evaluation,the results suggest that WLC-Net presents a promising approach for wood-leaf classification,distinguished by its high accuracy. In addition,WLC-Net also exhibits strong applicability across various tree point clouds and holds promise for further optimization.