Abstract:The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe. However, effectively analyzing this vast amount of data poses a significant challenge. Astronomers are turning to deep learning techniques to address this, but the methods are limited by their specific training sets, leading to considerable duplicate workloads too. Hence, as an example to present how to overcome the issue, we built a framework for general analysis of galaxy images, based on a large vision model (LVM) plus downstream tasks (DST), including galaxy morphological classification, image restoration, object detection, parameter extraction, and more. Considering the low signal-to-noise ratio of galaxy images and the imbalanced distribution of galaxy categories, we have incorporated a Human-in-the-loop (HITL) module into our large vision model, which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively. The proposed framework exhibits notable few-shot learning capabilities and versatile adaptability to all the abovementioned tasks on galaxy images in the DESI legacy imaging surveys. Expressly, for object detection, trained by 1000 data points, our DST upon the LVM achieves an accuracy of 96.7%, while ResNet50 plus Mask R-CNN gives an accuracy of 93.1%; for morphology classification, to obtain AUC ~0.9, LVM plus DST and HITL only requests 1/50 training sets compared to ResNet18. Expectedly, multimodal data can be integrated similarly, which opens up possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-message astronomy.
Abstract:The Segment Anything Model (SAM) is a powerful foundation model that introduced revolutionary advancements in natural image segmentation. However, its performance remains sub-optimal when delineating the intricate structure of biomedical images, where multiple organs and tissues intertwine in a single image. In this study, we introduce a novel fine-tuning framework that leverages SAM's ability to bundle and process multiple prompts per image and seeks to improve SAM's performance in medical images. We first curated a medical image dataset that consists of CT scans of lesions in various organs, each with two annotations for organs and lesions respectively. Then, we fine-tuned SAM's mask decoder within our framework by batching both bounding boxes generated from ground truth masks as reference. The batched prompt strategy we introduced not only addresses the inherent complexity and ambiguity often found in medical images but also substantially enhances performance metrics when applied onto a wide range of segmentation tasks.
Abstract:With the development of a series of Galaxy sky surveys in recent years, the observations increased rapidly, which makes the research of machine learning methods for galaxy image recognition a hot topic. Available automatic galaxy image recognition researches are plagued by the large differences in similarity between categories, the imbalance of data between different classes, and the discrepancy between the discrete representation of Galaxy classes and the essentially gradual changes from one morphological class to the adjacent class (DDRGC). These limitations have motivated several astronomers and machine learning experts to design projects with improved galaxy image recognition capabilities. Therefore, this paper proposes a novel learning method, ``Hierarchical Imbalanced data learning with Weighted sampling and Label smoothing" (HIWL). The HIWL consists of three key techniques respectively dealing with the above-mentioned three problems: (1) Designed a hierarchical galaxy classification model based on an efficient backbone network; (2) Utilized a weighted sampling scheme to deal with the imbalance problem; (3) Adopted a label smoothing technique to alleviate the DDRGC problem. We applied this method to galaxy photometric images from the Galaxy Zoo-The Galaxy Challenge, exploring the recognition of completely round smooth, in between smooth, cigar-shaped, edge-on and spiral. The overall classification accuracy is 96.32\%, and some superiorities of the HIWL are shown based on recall, precision, and F1-Score in comparing with some related works. In addition, we also explored the visualization of the galaxy image features and model attention to understand the foundations of the proposed scheme.
Abstract:The accuracy of the estimated stellar atmospheric parameter decreases evidently with the decreasing of spectral signal-to-noise ratio (SNR) and there are a huge amount of this kind observations, especially in case of SNR$<$30. Therefore, it is helpful to improve the parameter estimation performance for these spectra and this work studied the ($T_\texttt{eff}, \log~g$, [Fe/H]) estimation problem for LAMOST DR8 low-resolution spectra with 20$\leq$SNR$<$30. We proposed a data-driven method based on machine learning techniques. Firstly, this scheme detected stellar atmospheric parameter-sensitive features from spectra by the Least Absolute Shrinkage and Selection Operator (LASSO), rejected ineffective data components and irrelevant data. Secondly, a Multi-layer Perceptron (MLP) method was used to estimate stellar atmospheric parameters from the LASSO features. Finally, the performance of the LASSO-MLP was evaluated by computing and analyzing the consistency between its estimation and the reference from the APOGEE (Apache Point Observatory Galactic Evolution Experiment) high-resolution spectra. Experiments show that the Mean Absolute Errors (MAE) of $T_\texttt{eff}, \log~g$, [Fe/H] are reduced from the LASP (137.6 K, 0.195 dex, 0.091 dex) to LASSO-MLP (84.32 K, 0.137 dex, 0.063 dex), which indicate evident improvements on stellar atmospheric parameter estimation. In addition, this work estimated the stellar atmospheric parameters for 1,162,760 low-resolution spectra with 20$\leq$SNR$<$30 from LAMOST DR8 using LASSO-MLP, and released the estimation catalog, learned model, experimental code, trained model, training data and test data for scientific exploration and algorithm study.
Abstract:It is an active topic to investigate the schemes based on machine learning (ML) methods for detecting pulsars as the data volume growing exponentially in modern surveys. To improve the detection performance, input features into an ML model should be investigated specifically. In the existing pulsar detection researches based on ML methods, there are mainly two kinds of feature designs: the empirical features and statistical features. Due to the combinational effects from multiple features, however, there exist some redundancies and even irrelevant components in the available features, which can reduce the accuracy of a pulsar detection model. Therefore, it is essential to select a subset of relevant features from a set of available candidate features and known as {\itshape feature selection.} In this work, two feature selection algorithms ----\textit{Grid Search} (GS) and \textit{Recursive Feature Elimination} (RFE)---- are proposed to improve the detection performance by removing the redundant and irrelevant features. The algorithms were evaluated on the Southern High Time Resolution University survey (HTRU-S) with five pulsar detection models. The experimental results verify the effectiveness and efficiency of our proposed feature selection algorithms. By the GS, a model with only two features reach a recall rate as high as 99\% and a false positive rate (FPR) as low as 0.65\%; By the RFE, another model with only three features achieves a recall rate 99\% and an FPR of 0.16\% in pulsar candidates classification. Furthermore, this work investigated the number of features required as well as the misclassified pulsars by our models.
Abstract:We describe a scheme to extract linearly supporting (LSU) features from stellar spectra to automatically estimate the atmospheric parameters $T_{eff}$, log$~g$, and [Fe/H]. "Linearly supporting" means that the atmospheric parameters can be accurately estimated from the extracted features through a linear model. The successive steps of the process are as follow: first, decompose the spectrum using a wavelet packet (WP) and represent it by the derived decomposition coefficients; second, detect representative spectral features from the decomposition coefficients using the proposed method Least Absolute Shrinkage and Selection Operator (LARS)$_{bs}$; third, estimate the atmospheric parameters $T_{eff}$, log$~g$, and [Fe/H] from the detected features using a linear regression method. One prominent characteristic of this scheme is its ability to evaluate quantitatively the contribution of each detected feature to the atmospheric parameter estimate and also to trace back the physical significance of that feature. This work also shows that the usefulness of a component depends on both wavelength and frequency. The proposed scheme has been evaluated on both real spectra from the Sloan Digital Sky Survey (SDSS)/SEGUE and synthetic spectra calculated from Kurucz's NEWODF models. On real spectra, we extracted 23 features to estimate $T_{eff}$, 62 features for log$~g$, and 68 features for [Fe/H]. Test consistencies between our estimates and those provided by the Spectroscopic Sarameter Pipeline of SDSS show that the mean absolute errors (MAEs) are 0.0062 dex for log$~T_{eff}$ (83 K for $T_{eff}$), 0.2345 dex for log$~g$, and 0.1564 dex for [Fe/H]. For the synthetic spectra, the MAE test accuracies are 0.0022 dex for log$~T_{eff}$ (32 K for $T_{eff}$), 0.0337 dex for log$~g$, and 0.0268 dex for [Fe/H].