Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fabricio Murai

CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs

May 24, 2025

Yiqing Zhang, Xiaozhong Liu, Fabricio Murai

Abstract:Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.

* Accepted and to be published in KDD2025

Via

Access Paper or Ask Questions

KEDRec-LM: A Knowledge-distilled Explainable Drug Recommendation Large Language Model

Feb 27, 2025

Kai Zhang, Rui Zhu, Shutian Ma, Jingwei Xiong, Yejin Kim, Fabricio Murai, Xiaozhong Liu

Abstract:Drug discovery is a critical task in biomedical natural language processing (NLP), yet explainable drug discovery remains underexplored. Meanwhile, large language models (LLMs) have shown remarkable abilities in natural language understanding and generation. Leveraging LLMs for explainable drug discovery has the potential to improve downstream tasks and real-world applications. In this study, we utilize open-source drug knowledge graphs, clinical trial data, and PubMed publications to construct a comprehensive dataset for the explainable drug discovery task, named \textbf{expRxRec}. Furthermore, we introduce \textbf{KEDRec-LM}, an instruction-tuned LLM which distills knowledge from rich medical knowledge corpus for drug recommendation and rationale generation. To encourage further research in this area, we will publicly release\footnote{A copy is attached with this submission} both the dataset and KEDRec-LM.

Via

Access Paper or Ask Questions

MEXA-CTP: Mode Experts Cross-Attention for Clinical Trial Outcome Prediction

Jan 12, 2025

Yiqing Zhang, Xiaozhong Liu, Fabricio Murai

Abstract:Clinical trials are the gold standard for assessing the effectiveness and safety of drugs for treating diseases. Given the vast design space of drug molecules, elevated financial cost, and multi-year timeline of these trials, research on clinical trial outcome prediction has gained immense traction. Accurate predictions must leverage data of diverse modes such as drug molecules, target diseases, and eligibility criteria to infer successes and failures. Previous Deep Learning approaches for this task, such as HINT, often require wet lab data from synthesized molecules and/or rely on prior knowledge to encode interactions as part of the model architecture. To address these limitations, we propose a light-weight attention-based model, MEXA-CTP, to integrate readily-available multi-modal data and generate effective representations via specialized modules dubbed "mode experts", while avoiding human biases in model design. We optimize MEXA-CTP with the Cauchy loss to capture relevant interactions across modes. Our experiments on the Trial Outcome Prediction (TOP) benchmark demonstrate that MEXA-CTP improves upon existing approaches by, respectively, up to 11.3% in F1 score, 12.2% in PR-AUC, and 2.5% in ROC-AUC, compared to HINT. Ablation studies are provided to quantify the effectiveness of each component in our proposed method.

* Accepted and to be published in SDM2025

Via

Access Paper or Ask Questions

Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation

Sep 11, 2024

Gavin Butts, Pegah Emdad, Jethro Lee, Shannon Song, Chiman Salavati, Willmar Sosa Diaz, Shiri Dori-Hacohen, Fabricio Murai

Figure 1 for Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation

Figure 2 for Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation

Figure 3 for Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation

Figure 4 for Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation

Abstract:There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., "white matter of spinal cord"). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.

* Accepted for long presentation at the FAcctRec @ Recsys 2024

Via

Access Paper or Ask Questions

Hidden or Inferred: Fair Learning-To-Rank with Unknown Demographics

Jul 24, 2024

Oluseun Olulana, Kathleen Cachel, Fabricio Murai, Elke Rundensteiner

Abstract:As learning-to-rank models are increasingly deployed for decision-making in areas with profound life implications, the FairML community has been developing fair learning-to-rank (LTR) models. These models rely on the availability of sensitive demographic features such as race or sex. However, in practice, regulatory obstacles and privacy concerns protect this data from collection and use. As a result, practitioners may either need to promote fairness despite the absence of these features or turn to demographic inference tools to attempt to infer them. Given that these tools are fallible, this paper aims to further understand how errors in demographic inference impact the fairness performance of popular fair LTR strategies. In which cases would it be better to keep such demographic attributes hidden from models versus infer them? We examine a spectrum of fair LTR strategies ranging from fair LTR with and without demographic features hidden versus inferred to fairness-unaware LTR followed by fair re-ranking. We conduct a controlled empirical investigation modeling different levels of inference errors by systematically perturbing the inferred sensitive attribute. We also perform three case studies with real-world datasets and popular open-source inference methods. Our findings reveal that as inference noise grows, LTR-based methods that incorporate fairness considerations into the learning process may increase bias. In contrast, fair re-ranking strategies are more robust to inference errors. All source code, data, and experimental artifacts of our experimental study are available here: https://github.com/sewen007/hoiltr.git

* This paper has been accepted by AAAI/AIES to the AIES 2024 conference

Via

Access Paper or Ask Questions

Towards Detecting Cascades of Biased Medical Claims on Twitter

Dec 22, 2023

Libby Tiderman, Juan Sanchez Mercedes, Fiona Romanoschi, Fabricio Murai

Abstract:Social media may disseminate medical claims that highlight misleading correlations between social identifiers and diseases due to not accounting for structural determinants of health. Our research aims to identify biased medical claims on Twitter and measure their spread. We propose a machine learning framework that uses two models in tandem: RoBERTa to detect medical claims and DistilBERT to classify bias. After identifying original biased medical claims, we conducted a retweet cascade analysis, computing their individual reach and rate of spread. Tweets containing biased claims were found to circulate faster and further than unbiased claims.

* Accepted at 2023 MIT Undergraduate Research Technology Conference (URTC'23)

Via

Access Paper or Ask Questions

Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks

May 20, 2022

Henrique S. Assumpção, Fabrício Souza, Leandro Lacerda Campos, Vinícius T. de Castro Pires, Paulo M. Laurentys de Almeida, Fabricio Murai

Figure 1 for Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks

Figure 2 for Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks

Figure 3 for Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks

Figure 4 for Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks

Abstract:Money laundering is one of the most relevant criminal activities today, due to its potential to cause massive financial losses to governments, banks, etc. We propose DELATOR, a new CAAT (computer-assisted audit technology) to detect money laundering activities based on neural network models that encode bank transfers as a large-scale temporal graph. In collaboration with a Brazilian bank, we design and apply an evaluation strategy to quantify DELATOR's performance on historic data comprising millions of clients. DELATOR outperforms an off-the-shelf solution from Amazon AWS by 18.9% with respect to AUC. We conducted real experiments that led to discovery of 8 new suspicious among 100 analyzed cases, which would have been reported to the authorities under the current criteria.

* in Portuguese language. Accepted for publication in the 11th Brazilian Workshop on Social Network Analysis and Mining (BraSNAM)

Via

Access Paper or Ask Questions

Top-Down Deep Clustering with Multi-generator GANs

Dec 24, 2021

Daniel de Mello, Renato Assunção, Fabricio Murai

Figure 1 for Top-Down Deep Clustering with Multi-generator GANs

Figure 2 for Top-Down Deep Clustering with Multi-generator GANs

Figure 3 for Top-Down Deep Clustering with Multi-generator GANs

Figure 4 for Top-Down Deep Clustering with Multi-generator GANs

Abstract:Deep clustering (DC) leverages the representation power of deep architectures to learn embedding spaces that are optimal for cluster analysis. This approach filters out low-level information irrelevant for clustering and has proven remarkably successful for high dimensional data spaces. Some DC methods employ Generative Adversarial Networks (GANs), motivated by the powerful latent representations these models are able to learn implicitly. In this work, we propose HC-MGAN, a new technique based on GANs with multiple generators (MGANs), which have not been explored for clustering. Our method is inspired by the observation that each generator of a MGAN tends to generate data that correlates with a sub-region of the real data distribution. We use this clustered generation to train a classifier for inferring from which generator a given image came from, thus providing a semantically meaningful clustering for the real distribution. Additionally, we design our method so that it is performed in a top-down hierarchical clustering tree, thus proposing the first hierarchical DC method, to the best of our knowledge. We conduct several experiments to evaluate the proposed method against recent DC methods, obtaining competitive results. Last, we perform an exploratory analysis of the hierarchical clustering tree that highlights how accurately it organizes the data in a hierarchy of semantically coherent patterns.

* Accepted to AAAI 2022

Via

Access Paper or Ask Questions

Fairness via AI: Bias Reduction in Medical Information

Sep 06, 2021

Shiri Dori-Hacohen, Roberto Montenegro, Fabricio Murai, Scott A. Hale, Keen Sung, Michela Blain, Jennifer Edwards-Johnson

Abstract:Most Fairness in AI research focuses on exposing biases in AI systems. A broader lens on fairness reveals that AI can serve a greater aspiration: rooting out societal inequities from their source. Specifically, we focus on inequities in health information, and aim to reduce bias in that domain using AI. The AI algorithms under the hood of search engines and social media, many of which are based on recommender systems, have an outsized impact on the quality of medical and health information online. Therefore, embedding bias detection and reduction into these recommender systems serving up medical and health content online could have an outsized positive impact on patient outcomes and wellbeing. In this position paper, we offer the following contributions: (1) we propose a novel framework of Fairness via AI, inspired by insights from medical education, sociology and antiracism; (2) we define a new term, bisinformation, which is related to, but distinct from, misinformation, and encourage researchers to study it; (3) we propose using AI to study, detect and mitigate biased, harmful, and/or false health information that disproportionately hurts minority groups in society; and (4) we suggest several pillars and pose several open problems in order to seed inquiry in this new space. While part (3) of this work specifically focuses on the health domain, the fundamental computer science advances and contributions stemming from research efforts in bias reduction and Fairness via AI have broad implications in all areas of society.

* To appear in: The 4th FAccTRec Workshop on Responsible Recommendation at RecSys 2021

Via

Access Paper or Ask Questions

Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study

Apr 07, 2021

Francisco Galuppo Azevedo, Fabricio Murai

Figure 1 for Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study

Figure 2 for Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study

Figure 3 for Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study

Figure 4 for Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study

Abstract:Scientific knowledge cannot be seen as a set of isolated fields, but as a highly connected network. Understanding how research areas are connected is of paramount importance for adequately allocating funding and human resources (e.g., assembling teams to tackle multidisciplinary problems). The relationship between disciplines can be drawn from data on the trajectory of individual scientists, as researchers often make contributions in a small set of interrelated areas. Two recent works propose methods for creating research maps from scientists' publication records: by using a frequentist approach to create a transition probability matrix; and by learning embeddings (vector representations). Surprisingly, these models were evaluated on different datasets and have never been compared in the literature. In this work, we compare both models in a systematic way, using a large dataset of publication records from Brazilian researchers. We evaluate these models' ability to predict whether a given entity (scientist, institution or region) will enter a new field w.r.t. the area under the ROC curve. Moreover, we analyze how sensitive each method is to the number of publications and the number of fields associated to one entity. Last, we conduct a case study to showcase how these models can be used to characterize science dynamics in the context of Brazil.

* PLoS ONE 16(3): e0248724 (2021)
* 28 pages, 11 figures

Via

Access Paper or Ask Questions