Abstract:Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
Abstract:Background Knowledge graphs (KGs), especially medical knowledge graphs, are often significantly incomplete, so it necessitating a demand for medical knowledge graph completion (MedKGC). MedKGC can find new facts based on the exited knowledge in the KGs. The path-based knowledge reasoning algorithm is one of the most important approaches to this task. This type of method has received great attention in recent years because of its high performance and interpretability. In fact, traditional methods such as path ranking algorithm (PRA) take the paths between an entity pair as atomic features. However, the medical KGs are very sparse, which makes it difficult to model effective semantic representation for extremely sparse path features. The sparsity in the medical KGs is mainly reflected in the long-tailed distribution of entities and paths. Previous methods merely consider the context structure in the paths of the knowledge graph and ignore the textual semantics of the symbols in the path. Therefore, their performance cannot be further improved due to the two aspects of entity sparseness and path sparseness. To address the above issues, this paper proposes two novel path-based reasoning methods to solve the sparsity issues of entity and path respectively, which adopts the textual semantic information of entities and paths for MedKGC. By using the pre-trained model BERT, combining the textual semantic representations of the entities and the relationships, we model the task of symbolic reasoning in the medical KG as a numerical computing issue in textual semantic representation.
Abstract:The joint entity and relation extraction task aims to extract all relational triples from a sentence. In essence, the relational triples contained in a sentence are unordered. However, previous seq2seq based models require to convert the set of triples into a sequence in the training phase. To break this bottleneck, we treat joint entity and relation extraction as a direct set prediction problem, so that the extraction model can get rid of the burden of predicting the order of multiple triples. To solve this set prediction problem, we propose networks featured by transformers with non-autoregressive parallel decoding. Unlike autoregressive approaches that generate triples one by one in a certain order, the proposed networks directly output the final set of triples in one shot. Furthermore, we also design a set-based loss that forces unique predictions via bipartite matching. Compared with cross-entropy loss that highly penalizes small shifts in triple order, the proposed bipartite matching loss is invariant to any permutation of predictions; thus, it can provide the proposed networks with a more accurate training signal by ignoring triple order and focusing on relation types and entities. Experiments on two benchmark datasets show that our proposed model significantly outperforms current state-of-the-art methods. Training code and trained models will be available at http://github.com/DianboWork/SPN4RE.
Abstract:The ordered weighted $\ell_1$ norm (OWL) was recently proposed, with two different motivations: its good statistical properties as a sparsity promoting regularizer; the fact that it generalizes the so-called {\it octagonal shrinkage and clustering algorithm for regression} (OSCAR), which has the ability to cluster/group regression variables that are highly correlated. This paper contains several contributions to the study and application of OWL regularization: the derivation of the atomic formulation of the OWL norm; the derivation of the dual of the OWL norm, based on its atomic formulation; a new and simpler derivation of the proximity operator of the OWL norm; an efficient scheme to compute the Euclidean projection onto an OWL ball; the instantiation of the conditional gradient (CG, also known as Frank-Wolfe) algorithm for linear regression problems under OWL regularization; the instantiation of accelerated projected gradient algorithms for the same class of problems. Finally, a set of experiments give evidence that accelerated projected gradient algorithms are considerably faster than CG, for the class of problems considered.
Abstract:We consider a new family of regularizers, termed {\it weighted sorted $\ell_1$ norms} (WSL1), which generalizes the recently introduced {\it octagonal shrinkage and clustering algorithm for regression} (OSCAR) and also contains the $\ell_1$ and $\ell_{\infty}$ norms as particular instances. We focus on a special case of the WSL1, the {\sl decreasing WSL1} (DWSL1), where the elements of the argument vector are sorted in non-increasing order and the weights are also non-increasing. In this paper, after showing that the DWSL1 is indeed a norm, we derive two key tools for its use as a regularizer: the dual norm and the Moreau proximity operator.
Abstract:We propose a new method, {\it robust binary fused compressive sensing} (RoBFCS), to recover sparse piece-wise smooth signals from 1-bit compressive measurements. The proposed method is a modification of our previous {\it binary fused compressive sensing} (BFCS) algorithm, which is based on the {\it binary iterative hard thresholding} (BIHT) algorithm. As in BIHT, the data term of the objective function is a one-sided $\ell_1$ (or $\ell_2$) norm. Experiments show that the proposed algorithm is able to take advantage of the piece-wise smoothness of the original signal and detect sign flips and correct them, achieving more accurate recovery than BFCS and BIHT.
Abstract:We propose a new approach, {\it two-dimensional fused binary compressive sensing} (2DFBCS) to recover 2D sparse piece-wise signals from 1-bit measurements, exploiting 2D group sparsity for 1-bit compressive sensing recovery. The proposed method is a modified 2D version of the previous {\it binary iterative hard thresholding} (2DBIHT) algorithm, where the objective function includes a 2D one-sided $\ell_1$ (or $\ell_2$) penalty function encouraging agreement with the observed data, an indicator function of $K$-sparsity, and a total variation (TV) or modified TV (MTV) constraint. The subgradient of the 2D one-sided $\ell_1$ (or $\ell_2$) penalty and the projection onto the $K$-sparsity and TV or MTV constraint can be computed efficiently, allowing the appliaction of algorithms of the {\it forward-backward splitting} (a.k.a. {\it iterative shrinkage-thresholding}) family. Experiments on the recovery of 2D sparse piece-wise smooth signals show that the proposed approach is able to take advantage of the piece-wise smoothness of the original signal, achieving more accurate recovery than 2DBIHT. More specifically, 2DFBCS with the MTV and the $\ell_2$ penalty performs best amongst the algorithms tested.
Abstract:We apply the OSCAR (octagonal selection and clustering algorithms for regression) in recovering group-sparse matrices (two-dimensional---2D---arrays) from compressive measurements. We propose a 2D version of OSCAR (2OSCAR) consisting of the $\ell_1$ norm and the pair-wise $\ell_{\infty}$ norm, which is convex but non-differentiable. We show that the proximity operator of 2OSCAR can be computed based on that of OSCAR. The 2OSCAR problem can thus be efficiently solved by state-of-the-art proximal splitting algorithms. Experiments on group-sparse 2D array recovery show that 2OSCAR regularization solved by the SpaRSA algorithm is the fastest choice, while the PADMM algorithm (with debiasing) yields the most accurate results.
Abstract:We propose a new method, {\it binary fused compressive sensing} (BFCS), to recover sparse piece-wise smooth signals from 1-bit compressive measurements. The proposed algorithm is a modification of the previous {\it binary iterative hard thresholding} (BIHT) algorithm, where, in addition to the sparsity constraint, the total-variation of the recovered signal is upper constrained. As in BIHT, the data term of the objective function is an one-sided $\ell_1$ (or $\ell_2$) norm. Experiments on the recovery of sparse piece-wise smooth signals show that the proposed algorithm is able to take advantage of the piece-wise smoothness of the original signal, achieving more accurate recovery than BIHT.
Abstract:We propose a novel SPARsity and Clustering (SPARC) regularizer, which is a modified version of the previous octagonal shrinkage and clustering algorithm for regression (OSCAR), where, the proposed regularizer consists of a $K$-sparse constraint and a pair-wise $\ell_{\infty}$ norm restricted on the $K$ largest components in magnitude. The proposed regularizer is able to separably enforce $K$-sparsity and encourage the non-zeros to be equal in magnitude. Moreover, it can accurately group the features without shrinking their magnitude. In fact, SPARC is closely related to OSCAR, so that the proximity operator of the former can be efficiently computed based on that of the latter, allowing using proximal splitting algorithms to solve problems with SPARC regularization. Experiments on synthetic data and with benchmark breast cancer data show that SPARC is a competitive group-sparsity inducing regularizer for regression and classification.