AI-Med Group, AI Innovation Center, Sharif University of Technology, Tehran, Iran, DML Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Abstract:Grapheme-to-phoneme (G2P) conversion is critical in speech processing, particularly for applications like speech synthesis. G2P systems must possess linguistic understanding and contextual awareness of languages with polyphone words and context-dependent phonemes. Large language models (LLMs) have recently demonstrated significant potential in various language tasks, suggesting that their phonetic knowledge could be leveraged for G2P. In this paper, we evaluate the performance of LLMs in G2P conversion and introduce prompting and post-processing methods that enhance LLM outputs without additional training or labeled data. We also present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language. Our results show that by applying the proposed methods, LLMs can outperform traditional G2P tools, even in an underrepresented language like Persian, highlighting the potential of developing LLM-aided G2P systems.
Abstract:In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.
Abstract:Decentralized optimization strategies are helpful for various applications, from networked estimation to distributed machine learning. This paper studies finite-sum minimization problems described over a network of nodes and proposes a computationally efficient algorithm that solves distributed convex problems and optimally finds the solution to locally non-convex objective functions. In contrast to batch gradient optimization in some literature, our algorithm is on a single-time scale with no extra inner consensus loop. It evaluates one gradient entry per node per time. Further, the algorithm addresses link-level nonlinearity representing, for example, logarithmic quantization of the exchanged data or clipping of the exchanged data bits. Leveraging perturbation-based theory and algebraic Laplacian network analysis proves optimal convergence and dynamics stability over time-varying and switching networks. The time-varying network setup might be due to packet drops or link failures. Despite the nonlinear nature of the dynamics, we prove exact convergence in the face of odd sign-preserving sector-bound nonlinear data transmission over the links. Illustrative numerical simulations further highlight our contributions.
Abstract:Decentralized strategies are of interest for learning from large-scale data over networks. This paper studies learning over a network of geographically distributed nodes/agents subject to quantization. Each node possesses a private local cost function, collectively contributing to a global cost function, which the proposed methodology aims to minimize. In contrast to many existing literature, the information exchange among nodes is quantized. We adopt a first-order computationally-efficient distributed optimization algorithm (with no extra inner consensus loop) that leverages node-level gradient correction based on local data and network-level gradient aggregation only over nearby nodes. This method only requires balanced networks with no need for stochastic weight design. It can handle log-scale quantized data exchange over possibly time-varying and switching network setups. We analyze convergence over both structured networks (for example, training over data-centers) and ad-hoc multi-agent networks (for example, training over dynamic robotic networks). Through analysis and experimental validation, we show that (i) structured networks generally result in a smaller optimality gap, and (ii) logarithmic quantization leads to smaller optimality gap compared to uniform quantization.
Abstract:Meta-learning involves multiple learners, each dedicated to specific tasks, collaborating in a data-constrained setting. In current meta-learning methods, task learners locally learn models from sensitive data, termed support sets. These task learners subsequently share model-related information, such as gradients or loss values, which is computed using another part of the data termed query set, with a meta-learner. The meta-learner employs this information to update its meta-knowledge. Despite the absence of explicit data sharing, privacy concerns persist. This paper examines potential data leakage in a prominent metalearning algorithm, specifically Model-Agnostic Meta-Learning (MAML). In MAML, gradients are shared between the metalearner and task-learners. The primary objective is to scrutinize the gradient and the information it encompasses about the task dataset. Subsequently, we endeavor to propose membership inference attacks targeting the task dataset containing support and query sets. Finally, we explore various noise injection methods designed to safeguard the privacy of task data and thwart potential attacks. Experimental results demonstrate the effectiveness of these attacks on MAML and the efficacy of proper noise injection methods in countering them.
Abstract:The protein-ligand binding affinity (PLA) prediction goal is to predict whether or not the ligand could bind to a protein sequence. Recently, in PLA prediction, deep learning has received much attention. Two steps are involved in deep learning-based approaches: feature extraction and task prediction step. Many deep learning-based approaches concentrate on introducing new feature extraction networks or integrating auxiliary knowledge like protein-protein interaction networks or gene ontology knowledge. Then, a task prediction network is designed simply using some fully connected layers. This paper aims to integrate retrieved similar hard protein-ligand pairs in PLA prediction (i.e., task prediction step) using a semi-supervised graph convolutional network (GCN). Hard protein-ligand pairs are retrieved for each input query sample based on the manifold smoothness constraint. Then, a graph is learned automatically in which each node is a protein-ligand pair, and each edge represents the similarity between pairs. In other words, an end-to-end framework is proposed that simultaneously retrieves hard similar samples, learns protein-ligand descriptor, learns the graph topology of the input sample with retrieved similar hard samples (learn adjacency matrix), and learns a semi-supervised GCN to predict the binding affinity (as task predictor). The training step adjusts the parameter values, and in the inference step, the learned model is fine-tuned for each input sample. To evaluate the proposed approach, it is applied to the four well-known PDBbind, Davis, KIBA, and BindingDB datasets. The results show that the proposed method significantly performs better than the comparable approaches.
Abstract:Motivation: Drug repurposing is a viable solution for reducing the time and cost associated with drug development. However, thus far, the proposed drug repurposing approaches still need to meet expectations. Therefore, it is crucial to offer a systematic approach for drug repurposing to achieve cost savings and enhance human lives. In recent years, using biological network-based methods for drug repurposing has generated promising results. Nevertheless, these methods have limitations. Primarily, the scope of these methods is generally limited concerning the size and variety of data they can effectively handle. Another issue arises from the treatment of heterogeneous data, which needs to be addressed or converted into homogeneous data, leading to a loss of information. A significant drawback is that most of these approaches lack end-to-end functionality, necessitating manual implementation and expert knowledge in certain stages. Results: We propose a new solution, HGTDR (Heterogeneous Graph Transformer for Drug Repurposing), to address the challenges associated with drug repurposing. HGTDR is a three-step approach for knowledge graph-based drug re-purposing: 1) constructing a heterogeneous knowledge graph, 2) utilizing a heterogeneous graph transformer network, and 3) computing relationship scores using a fully connected network. By leveraging HGTDR, users gain the ability to manipulate input graphs, extract information from diverse entities, and obtain their desired output. In the evaluation step, we demonstrate that HGTDR performs comparably to previous methods. Furthermore, we review medical studies to validate our method's top ten drug repurposing suggestions, which have exhibited promising results. We also demon-strated HGTDR's capability to predict other types of relations through numerical and experimental validation, such as drug-protein and disease-protein inter-relations.
Abstract:Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a gene editing technology that has revolutionized the fields of biology and medicine. However, one of the challenges of using CRISPR is predicting the on-target efficacy and off-target sensitivity of single-guide RNAs (sgRNAs). This is because most existing methods are trained on separate datasets with different genes and cells, which limits their generalizability. In this paper, we propose a novel ensemble learning method for sgRNA design that is accurate and generalizable. Our method combines the predictions of multiple machine learning models to produce a single, more robust prediction. This approach allows us to learn from a wider range of data, which improves the generalizability of our model. We evaluated our method on a benchmark dataset of sgRNA designs and found that it outperformed existing methods in terms of both accuracy and generalizability. Our results suggest that our method can be used to design sgRNAs with high sensitivity and specificity, even for new genes or cells. This could have important implications for the clinical use of CRISPR, as it would allow researchers to design more effective and safer treatments for a variety of diseases.
Abstract:Often, recommendation systems employ continuous training, leading to a self-feedback loop bias in which the system becomes biased toward its previous recommendations. Recent studies have attempted to mitigate this bias by collecting small amounts of unbiased data. While these studies have successfully developed less biased models, they ignore the crucial fact that the recommendations generated by the model serve as the training data for subsequent training sessions. To address this issue, we propose a framework that learns an unbiased estimator using a small amount of uniformly collected data and focuses on generating improved training data for subsequent training iterations. To accomplish this, we view recommendation as a contextual multi-arm bandit problem and emphasize on exploring items that the model has a limited understanding of. We introduce a new offline sequential training schema that simulates real-world continuous training scenarios in recommendation systems, offering a more appropriate framework for studying self-feedback bias. We demonstrate the superiority of our model over state-of-the-art debiasing methods by conducting extensive experiments using the proposed training schema.
Abstract:The fast growth of social networks and their data access limitations in recent years has led to increasing difficulty in obtaining the complete topology of these networks. However, diffusion information over these networks is available, and many algorithms have been proposed to infer the underlying networks using this information. The previously proposed algorithms only focus on inferring more links and ignore preserving the critical topological characteristics of the underlying social networks. In this paper, we propose a novel method called DANI to infer the underlying network while preserving its structural properties. It is based on the Markov transition matrix derived from time series cascades, as well as the node-node similarity that can be observed in the cascade behavior from a structural point of view. In addition, the presented method has linear time complexity (increases linearly with the number of nodes, number of cascades, and square of the average length of cascades), and its distributed version in the MapReduce framework is also scalable. We applied the proposed approach to both real and synthetic networks. The experimental results showed that DANI has higher accuracy and lower run time while maintaining structural properties, including modular structure, degree distribution, connected components, density, and clustering coefficients, than well-known network inference methods.