Publications

Explore the publications powered by PromptBio, featuring peer-reviewed articles, collaborative research studies, and thought leadership pieces. Our publications highlight scientific breakthroughs, practical applications, and innovative methodologies that demonstrate the impact of data-driven insights across the life sciences.

background pattern

PromptBio: A Multi-Agent AI Platform for Bioinformatics Data Analysis

PromptBio is a modular AI platform for scalable, reproducible, and user-adaptable bioinformatics analysis, powered by generative AI and natural language interaction. It supports three complementary modes of analysis designed to meet diverse research needs. PromptGenie is a multi-agent system that enables stepwise, human-in-the-loop workflows using prevalidated domain-standard tools. Within PromptGenie, specialized agents—including DataAgent, OmicsAgent, AnalysisAgent, and QAgent—collaborate to manage tasks such as data ingestion, pipeline execution, statistical analysis, and interactive summarization. DiscoverFlow provides integrated, automated workflows for large-scale multi-omics analysis, offering end-to-end execution and streamlined orchestration. ToolsGenie complements these modes by dynamically generating executable bioinformatics code for custom, user-defined analyses, enabling flexibility beyond standardized workflows. PromptGenie and DiscoverFlow leverage a suite of domain-specific tools, including Omics Tools for standardized omics pipelines, Analysis Tools for downstream statistical interpretation, and MLGenie for machine learning and multi-omics modeling. We present the design, capabilities, and validation of these components, highlight their integration into automated and customizable workflows, and discuss extensibility, monitoring, and compliance. PromptBio aims to democratize high-throughput bioinformatics through a large language model–powered, natural language understanding, workflow generation and agent orchestration.

Dual-LLM Adversarial Framework for Information Extraction from Research Literature

Information Extraction (IE) is a fundamental task in Natural Language Processing (NLP) that aims to automatically identify relevant information from unstructured or semi-structured data. Information extraction from lengthy research literature, particularly in multi-omics studies, faces significant challenges due to their complex narratives and extensive context. To address this, we present a novel dual–LLM adversarial framework in which one large language model (LLM) performs the extraction and another provides iterative feedback to refine the results. This process systematically reduces errors, enhances consistency across heterogeneous data sources, and converges toward more accurate outputs. We evaluated our approach against manual and single-LLM extraction, using LLMs as evaluators. Experimental results show that our adversarial framework outperforms these baselines, highlighting its effectiveness for extracting structured information from lengthy scientific texts.

Experimenting with Two Recent Feature Selection Methods for High-Dimensional Biological Data

Feature selection in high-dimensional biological data, where the number of features far exceeds the number of samples, has long posed a significant methodological challenge. This study evaluates two recently developed feature selection methods, Stabl and Nullstrap, under a simulation framework designed to replicate regression, classification, and non-linear regression tasks across varying feature dimensions and noise levels. Our results demonstrate that Nullstrap consistently outperforms Stabl and other benchmarked methods across all evaluated scenarios. Furthermore, Nullstrap proved significantly faster and more scalable in high-dimensional settings, underscoring its suitability for large-scale omics data applications. These findings establish Nullstrap as a robust, accurate, and computationally efficient feature selection tool for modern omics data analysis.

MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction

Natural language processing (NLP) tasks aim to convert unstructured text data (e.g. articles or dialogues) to structured information. In recent years, we have witnessed fundamental advances of NLP technique, which has been widely used in many applications such as financial text mining, news recommendation and machine translation. However, its application in the biomedical space remains challenging due to a lack of labeled data, ambiguities and inconsistencies of biological terminology. In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation. In addition, the fast speed of machine reader helps quickly orient research and development.

To address the aforementioned needs, we developed automatic training data labeling, rule-based biological terminology cleaning and a more accurate NLP model for binary associative and multi-relation prediction into the MarkerGenie program. We demonstrated the effectiveness of the proposed methods in identifying relations between biomedical entities on various benchmark datasets and case studies.

Tumor Cell Fraction Estimation Based on Tissue Region Segmentation and Nuclear Density

Tumor cell fraction (TCF), or tumor purity, is a critical factor in cancer diagnosis, prognosis, and molecular profiling. While genomic methods provide accurate TCF estimates, they are costly, time-consuming, and lack spatial resolution. Recent whole slide image-based approaches offer a more scalable alternative but often suffer from limited interpretability and inconsistent accuracy. We propose a novel TCF estimation method based on tissue–nuclei density (TNuD), integrating tissue region segmentation and nuclear classification from hematoxylin and eosin -stained whole slide images. The method consists of a DeepLabV3+-based tissue region segmentation model and a HoVer-Net-based nuclear segmentation and classification model. The outputs of the two models are fused to construct a TNuD matrix representing the spatial density relationships between tissue and nuclei. We evaluated the proposed method against four TCF estimation baselines using expert-annotated breast and ovarian cancer datasets. The proposed TNuD-based method achieved the lowest mean squared error (MSE = 0.0214) and highest correlation with pathologist annotations (Pearson = 0.8683; Spearman = 0.8737) in breast cancer datasets. It also demonstrated promising transferability to ovarian cancer tissues. Comparative analysis also showed superior precision and interpretability over region- or nucleus-only models. The TNuD-based method effectively captures tumor heterogeneity by combining macro- and micro-level histological features. It offers a scalable, interpretable, and accurate solution for TCF estimation in digital pathology, supporting broader clinical and translational oncology applications.

Prediction of biomarker–disease associations based on graph attention network and text representation

The associations between biomarkers and human diseases play a key role in understanding complex pathology and developing targeted therapies. Wet lab experiments for biomarker discovery are costly, laborious and time-consuming. Computational prediction methods can be used to greatly expedite the identification of candidate biomarkers.

Here, we present a novel computational model named GTGenie for predicting the biomarker–disease associations based on graph and text features. In GTGenie, a graph attention network is utilized to characterize diverse similarities of biomarkers and diseases from heterogeneous information resources. Meanwhile, a pretrained BERT-based model is applied to learn the text-based representation of biomarker–disease relation from biomedical literature. The captured graph and text features are then integrated in a bimodal fusion network to model the hybrid entity representation. Finally, inductive matrix completion is adopted to infer the missing entries for reconstructing relation matrix, with which the unknown biomarker–disease associations are predicted. Experimental results on HMDD, HMDAD and LncRNADisease data sets showed that GTGenie can obtain competitive prediction performance with other state-of-the-art methods.

The source code of GTGenie and the test data are available at: GTGenie.

A Survey of Deep Learning in Histopathological Nuclear Segmentation

Histopathological images contain rich information that can be used to diagnose and monitor disease progression and to predict patient survival. Accurate morphological segmentation, in particular nuclear segmentation, of these images is a critical task, where many deep learning (DL) based methods have been widely used. This article provides a thorough review of these methods from three different aspects: nuclear semantic segmentation, nuclear instance segmentation, and joint nuclear segmentation and classification. It also summarizes the process of whole-slide image nuclear segmentation, followed by an extensive compilation of publicly available datasets and evaluation metrics that have been widely used to evaluate the DL based methods. Finally, the challenges and future directions in this field are discussed. This survey is expected to serve as a practical guide for researchers interested in using and improving DL-based models for histopathological nuclear segmentation.

Get started with a free trial

Access the next generation of research for life sciences.