Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeAccelerating the Generation of Molecular Conformations with Progressive Distillation of Equivariant Latent Diffusion Models
Recent advances in fast sampling methods for diffusion models have demonstrated significant potential to accelerate generation on image modalities. We apply these methods to 3-dimensional molecular conformations by building on the recently introduced GeoLDM equivariant latent diffusion model (Xu et al., 2023). We evaluate trade-offs between speed gains and quality loss, as measured by molecular conformation structural stability. We introduce Equivariant Latent Progressive Distillation, a fast sampling algorithm that preserves geometric equivariance and accelerates generation from latent diffusion models. Our experiments demonstrate up to 7.5x gains in sampling speed with limited degradation in molecular stability. These results suggest this accelerated sampling method has strong potential for high-throughput in silico molecular conformations screening in computational biochemistry, drug discovery, and life sciences applications.
QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry
Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.
AdsorbML: Accelerating Adsorption Energy Calculations with Machine Learning
Computational catalysis is playing an increasingly significant role in the design of catalysts across a wide range of applications. A common task for many computational methods is the need to accurately compute the minimum binding energy - the adsorption energy - for an adsorbate and a catalyst surface of interest. Traditionally, the identification of low energy adsorbate-surface configurations relies on heuristic methods and researcher intuition. As the desire to perform high-throughput screening increases, it becomes challenging to use heuristics and intuition alone. In this paper, we demonstrate machine learning potentials can be leveraged to identify low energy adsorbate-surface configurations more accurately and efficiently. Our algorithm provides a spectrum of trade-offs between accuracy and efficiency, with one balanced option finding the lowest energy configuration, within a 0.1 eV threshold, 86.33% of the time, while achieving a 1331x speedup in computation. To standardize benchmarking, we introduce the Open Catalyst Dense dataset containing nearly 1,000 diverse surfaces and 85,658 unique configurations.
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling.
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a novel numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including 3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at https://github.com/QizhiPei/BioT5.
Efficient Implementation of Gaussian Process Regression Accelerated Saddle Point Searches with Application to Molecular Reactions
The task of locating first order saddle points on high-dimensional surfaces describing the variation of energy as a function of atomic coordinates is an essential step for identifying the mechanism and estimating the rate of thermally activated events within the harmonic approximation of transition state theory. When combined directly with electronic structure calculations, the number of energy and atomic force evaluations needed for convergence is a primary issue. Here, we describe an efficient implementation of Gaussian process regression (GPR) acceleration of the minimum mode following method where a dimer is used to estimate the lowest eigenmode of the Hessian. A surrogate energy surface is constructed and updated after each electronic structure calculation. The method is applied to a test set of 500 molecular reactions previously generated by Hermez and coworkers [J. Chem. Theory Comput. 18, 6974 (2022)]. An order of magnitude reduction in the number of electronic structure calculations needed to reach the saddle point configurations is obtained by using the GPR compared to the dimer method. Despite the wide range in stiffness of the molecular degrees of freedom, the calculations are carried out using Cartesian coordinates and are found to require similar number of electronic structure calculations as an elaborate internal coordinate method implemented in the Sella software package. The present implementation of the GPR surrogate model in C++ is efficient enough for the wall time of the saddle point searches to be reduced in 3 out of 4 cases even though the calculations are carried out at a low Hartree-Fock level.
A Benchmark for Quantum Chemistry Relaxations via Machine Learning Interatomic Potentials
Computational quantum chemistry plays a critical role in drug discovery, chemical synthesis, and materials science. While first-principles methods, such as density functional theory (DFT), provide high accuracy in modeling electronic structures and predicting molecular properties, they are computationally expensive. Machine learning interatomic potentials (MLIPs) have emerged as promising surrogate models that aim to achieve DFT-level accuracy while enabling efficient large-scale atomistic simulations. The development of accurate and transferable MLIPs requires large-scale, high-quality datasets with both energy and force labels. Critically, MLIPs must generalize not only to stable geometries but also to intermediate, non-equilibrium conformations encountered during atomistic simulations. In this work, we introduce PubChemQCR, a large-scale dataset of molecular relaxation trajectories curated from the raw geometry optimization outputs of the PubChemQC project. PubChemQCR is the largest publicly available dataset of DFT-based relaxation trajectories for small organic molecules, comprising approximately 3.5 million trajectories and over 300 million molecular conformations computed at various levels of theory. Each conformation is labeled with both total energy and atomic forces, making the dataset suitable for training and evaluating MLIPs. To provide baselines for future developments, we benchmark nine representative MLIP models on the dataset. Our resources are publicly available at https://huggingface.co/divelab
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called nabla^2DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level (omegaB97X-D/def2-SVP) for each conformation. Moreover, nabla^2DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
Efficiently predicting high resolution mass spectra with graph neural networks
Identifying a small molecule from its mass spectrum is the primary open problem in computational metabolomics. This is typically cast as information retrieval: an unknown spectrum is matched against spectra predicted computationally from a large database of chemical structures. However, current approaches to spectrum prediction model the output space in ways that force a tradeoff between capturing high resolution mass information and tractable learning. We resolve this tradeoff by casting spectrum prediction as a mapping from an input molecular graph to a probability distribution over molecular formulas. We discover that a large corpus of mass spectra can be closely approximated using a fixed vocabulary constituting only 2% of all observed formulas. This enables efficient spectrum prediction using an architecture similar to graph classification - GrAFF-MS - achieving significantly lower prediction error and orders-of-magnitude faster runtime than state-of-the-art methods.
ATOM3D: Tasks On Molecules in Three Dimensions
Computational methods that operate on three-dimensional molecular structure have the potential to solve important questions in biology and chemistry. In particular, deep neural networks have gained significant attention, but their widespread adoption in the biomolecular domain has been limited by a lack of either systematic performance benchmarks or a unified toolkit for interacting with molecular data. To address this, we present ATOM3D, a collection of both novel and existing benchmark datasets spanning several key classes of biomolecules. We implement several classes of three-dimensional molecular learning methods for each of these tasks and show that they consistently improve performance relative to methods based on one- and two-dimensional representations. The specific choice of architecture proves to be critical for performance, with three-dimensional convolutional networks excelling at tasks involving complex geometries, graph networks performing well on systems requiring detailed positional information, and the more recently developed equivariant networks showing significant promise. Our results indicate that many molecular problems stand to gain from three-dimensional molecular learning, and that there is potential for improvement on many tasks which remain underexplored. To lower the barrier to entry and facilitate further developments in the field, we also provide a comprehensive suite of tools for dataset processing, model training, and evaluation in our open-source atom3d Python package. All datasets are available for download from https://www.atom3d.ai .
Bayesian Hierarchical Models for Quantitative Estimates for Performance metrics applied to Saddle Search Algorithms
Rigorous performance evaluation is essential for developing robust algorithms for high-throughput computational chemistry. Traditional benchmarking, however, often struggles to account for system-specific variability, making it difficult to form actionable conclusions. We present a Bayesian hierarchical modeling framework that rigorously quantifies performance metrics and their uncertainty, enabling a nuanced comparison of algorithmic strategies. We apply this framework to analyze the Dimer method, comparing Conjugate Gradient (CG) and L-BFGS rotation optimizers, with and without the removal of external rotations, across a benchmark of 500 molecular systems. Our analysis confirms that CG offers higher overall robustness than L-BFGS in this context. While the theoretically-motivated removal of external rotations led to higher computational cost (>40% more energy and force calls) for most systems in this set, our models also reveal a subtle interplay, hinting that this feature may improve the reliability of the L-BFGS optimizer. Rather than identifying a single superior method, our findings support the design of adaptive "chain of methods" workflows. This work showcases how a robust statistical paradigm can move beyond simple performance rankings to inform the intelligent, context-dependent application of computational chemistry methods.
Electron flow matching for generative reaction mechanism prediction obeying conservation laws
Central to our understanding of chemical reactivity is the principle of mass conservation, which is fundamental for ensuring physical consistency, balancing equations, and guiding reaction design. However, data-driven computational models for tasks such as reaction product prediction rarely abide by this most basic constraint. In this work, we recast the problem of reaction prediction as a problem of electron redistribution using the modern deep generative framework of flow matching. Our model, FlowER, overcomes limitations inherent in previous approaches by enforcing exact mass conservation, thereby resolving hallucinatory failure modes, recovering mechanistic reaction sequences for unseen substrate scaffolds, and generalizing effectively to out-of-domain reaction classes with extremely data-efficient fine-tuning. FlowER additionally enables estimation of thermodynamic or kinetic feasibility and manifests a degree of chemical intuition in reaction prediction tasks. This inherently interpretable framework represents a significant step in bridging the gap between predictive accuracy and mechanistic understanding in data-driven reaction outcome prediction.
Knowledge-informed Molecular Learning: A Survey on Paradigm Transfer
Machine learning, notably deep learning, has significantly propelled molecular investigations within the biochemical sphere. Traditionally, modeling for such research has centered around a handful of paradigms. For instance, the prediction paradigm is frequently deployed for tasks such as molecular property prediction. To enhance the generation and decipherability of purely data-driven models, scholars have integrated biochemical domain knowledge into these molecular study models. This integration has sparked a surge in paradigm transfer, which is solving one molecular learning task by reformulating it as another one. With the emergence of Large Language Models, these paradigms have demonstrated an escalating trend towards harmonized unification. In this work, we delineate a literature survey focused on knowledge-informed molecular learning from the perspective of paradigm transfer. We classify the paradigms, scrutinize their methodologies, and dissect the contribution of domain knowledge. Moreover, we encapsulate prevailing trends and identify intriguing avenues for future exploration in molecular learning.
QH9: A Quantum Hamiltonian Prediction Benchmark for QM9 Molecules
Supervised machine learning approaches have been increasingly used in accelerating electronic structure prediction as surrogates of first-principle computational methods, such as density functional theory (DFT). While numerous quantum chemistry datasets focus on chemical properties and atomic forces, the ability to achieve accurate and efficient prediction of the Hamiltonian matrix is highly desired, as it is the most important and fundamental physical quantity that determines the quantum states of physical systems and chemical properties. In this work, we generate a new Quantum Hamiltonian dataset, named as QH9, to provide precise Hamiltonian matrices for 999 or 2998 molecular dynamics trajectories and 130,831 stable molecular geometries, based on the QM9 dataset. By designing benchmark tasks with various molecules, we show that current machine learning models have the capacity to predict Hamiltonian matrices for arbitrary molecules. Both the QH9 dataset and the baseline models are provided to the community through an open-source benchmark, which can be highly valuable for developing machine learning methods and accelerating molecular and materials design for scientific and technological applications. Our benchmark is publicly available at https://github.com/divelab/AIRS/tree/main/OpenDFT/QHBench.
The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models
Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio simulations at sizes and time scales that were previously inaccessible. However, a fundamental challenge to creating ML models that perform well across molecular chemistry is the lack of comprehensive data for training. Despite substantial efforts in data generation, no large-scale molecular dataset exists that combines broad chemical diversity with a high level of accuracy. To address this gap, Meta FAIR introduces Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory (DFT) calculations at the omegaB97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute. OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures. There are ~83M unique molecular systems in OMol25 covering small molecules, biomolecules, metal complexes, and electrolytes, including structures obtained from existing datasets. OMol25 also greatly expands on the size of systems typically included in DFT datasets, with systems of up to 350 atoms. In addition to the public release of the data, we provide baseline models and a comprehensive set of model evaluations to encourage community engagement in developing the next-generation ML models for molecular chemistry.
ProtSolM: Protein Solubility Prediction with Multi-modal Features
Understanding protein solubility is essential for their functional applications. Computational methods for predicting protein solubility are crucial for reducing experimental costs and enhancing the efficiency and success rates of protein engineering. Existing methods either construct a supervised learning scheme on small-scale datasets with manually processed physicochemical properties, or blindly apply pre-trained protein language models to extract amino acid interaction information. The scale and quality of available training datasets leave significant room for improvement in terms of accuracy and generalization. To address these research gaps, we propose \sol, a novel deep learning method that combines pre-training and fine-tuning schemes for protein solubility prediction. ProtSolM integrates information from multiple dimensions, including physicochemical properties, amino acid sequences, and protein backbone structures. Our model is trained using \data, the largest solubility dataset that we have constructed. PDBSol includes over 60,000 protein sequences and structures. We provide a comprehensive leaderboard of existing statistical learning and deep learning methods on independent datasets with computational and experimental labels. ProtSolM achieved state-of-the-art performance across various evaluation metrics, demonstrating its potential to significantly advance the accuracy of protein solubility prediction.
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.
BoostMD: Accelerating molecular sampling by leveraging ML force field features from previous time-steps
Simulating atomic-scale processes, such as protein dynamics and catalytic reactions, is crucial for advancements in biology, chemistry, and materials science. Machine learning force fields (MLFFs) have emerged as powerful tools that achieve near quantum mechanical accuracy, with promising generalization capabilities. However, their practical use is often limited by long inference times compared to classical force fields, especially when running extensive molecular dynamics (MD) simulations required for many biological applications. In this study, we introduce BoostMD, a surrogate model architecture designed to accelerate MD simulations. BoostMD leverages node features computed at previous time steps to predict energies and forces based on positional changes. This approach reduces the complexity of the learning task, allowing BoostMD to be both smaller and significantly faster than conventional MLFFs. During simulations, the computationally intensive reference MLFF is evaluated only every N steps, while the lightweight BoostMD model handles the intermediate steps at a fraction of the computational cost. Our experiments demonstrate that BoostMD achieves an eight-fold speedup compared to the reference model and generalizes to unseen dipeptides. Furthermore, we find that BoostMD accurately samples the ground-truth Boltzmann distribution when running molecular dynamics. By combining efficient feature reuse with a streamlined architecture, BoostMD offers a robust solution for conducting large-scale, long-timescale molecular simulations, making high-accuracy ML-driven modeling more accessible and practical.
InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions
Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology, and enzyme engineering. Recent breakthroughs in Artificial Intelligence (AI) have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between AI's computational power and researchers' intuition, using natural language to align molecular complexity with human intentions. Large Language Models (LLMs) have shown potential to interpret human intentions, yet their application to biomolecular research remains nascent due to challenges including specialized knowledge requirements, multimodal data integration, and semantic alignment between natural language and biomolecules. To address these limitations, we present InstructBioMol, a novel LLM designed to bridge natural language and biomolecules through a comprehensive any-to-any alignment of natural language, molecules, and proteins. This model can integrate multimodal biomolecules as input, and enable researchers to articulate design goals in natural language, providing biomolecular outputs that meet precise biological needs. Experimental results demonstrate InstructBioMol can understand and design biomolecules following human instructions. Notably, it can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an ESP Score of 70.4, making it the only method to surpass the enzyme-substrate interaction threshold of 60.0 recommended by the ESP developer. This highlights its potential to transform real-world biomolecular research.
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations
While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.
CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes
Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein sequence by its enzyme commission (EC) number and (2) retrieval of an EC number given a chemical reaction. For each task, we design train-test splits to evaluate different kinds of out-of-distribution generalization that are relevant to real use cases. For the classification task, we provide baselines for state-of-the-art methods. Because the retrieval task has not been previously formalized, we propose a method called Contrastive Reaction-EnzymE Pretraining (CREEP) as one of the first baselines for this task and compare it to the recent method, CLIPZyme. CARE is available at https://github.com/jsunn-y/CARE/.
NMR-Solver: Automated Structure Elucidation via Large-Scale Spectral Matching and Physics-Guided Fragment Optimization
Nuclear Magnetic Resonance (NMR) spectroscopy is one of the most powerful and widely used tools for molecular structure elucidation in organic chemistry. However, the interpretation of NMR spectra to determine unknown molecular structures remains a labor-intensive and expertise-dependent process, particularly for complex or novel compounds. Although recent methods have been proposed for molecular structure elucidation, they often underperform in real-world applications due to inherent algorithmic limitations and limited high-quality data. Here, we present NMR-Solver, a practical and interpretable framework for the automated determination of small organic molecule structures from ^1H and ^{13}C NMR spectra. Our method introduces an automated framework for molecular structure elucidation, integrating large-scale spectral matching with physics-guided fragment-based optimization that exploits atomic-level structure-spectrum relationships in NMR. We evaluate NMR-Solver on simulated benchmarks, curated experimental data from the literature, and real-world experiments, demonstrating its strong generalization, robustness, and practical utility in challenging, real-life scenarios. NMR-Solver unifies computational NMR analysis, deep learning, and interpretable chemical reasoning into a coherent system. By incorporating the physical principles of NMR into molecular optimization, it enables scalable, automated, and chemically meaningful molecular identification, establishing a generalizable paradigm for solving inverse problems in molecular science.
Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing
Predicting ATP-Protein Binding sites in genes is of great significance in the field of Biology and Medicine. The majority of research in this field has been conducted through time- and resource-intensive 'wet experiments' in laboratories. Over the years, researchers have been investigating computational methods computational methods to accomplish the same goals, utilising the strength of advanced Deep Learning and NLP algorithms. In this paper, we propose to develop methods to classify ATP-Protein binding sites. We conducted various experiments mainly using PSSMs and several word embeddings as features. We used 2D CNNs and LightGBM classifiers as our chief Deep Learning Algorithms. The MP3Vec and BERT models have also been subjected to testing in our study. The outcomes of our experiments demonstrated improvement over the state-of-the-art benchmarks.
Deep Learning Methods for Small Molecule Drug Discovery: A Survey
With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing surveys only focus on one of the applications, limiting the view of researchers in the community. In this paper, we present a comprehensive review on the aforementioned four aspects, and discuss the relationships among different applications. The latest literature and classical benchmarks are presented for better understanding the development of variety of approaches. We commence by summarizing the molecule representation format in these works, followed by an introduction of recent proposed approaches for each of the four tasks. Furthermore, we review a variety of commonly used datasets and evaluation metrics and compare the performance of deep learning-based models. Finally, we conclude by identifying remaining challenges and discussing the future trend for deep learning methods in drug discovery.
Machine-learned molecular mechanics force field for the simulation of protein-ligand systems and beyond
The development of reliable and extensible molecular mechanics (MM) force fields -- fast, empirical models characterizing the potential energy surface of molecular systems -- is indispensable for biomolecular simulation and computer-aided drug design. Here, we introduce a generalized and extensible machine-learned MM force field, espaloma-0.3, and an end-to-end differentiable framework using graph neural networks to overcome the limitations of traditional rule-based methods. Trained in a single GPU-day to fit a large and diverse quantum chemical dataset of over 1.1M energy and force calculations, espaloma-0.3 reproduces quantum chemical energetic properties of chemical domains highly relevant to drug discovery, including small molecules, peptides, and nucleic acids. Moreover, this force field maintains the quantum chemical energy-minimized geometries of small molecules and preserves the condensed phase properties of peptides, self-consistently parametrizing proteins and ligands to produce stable simulations leading to highly accurate predictions of binding free energies. This methodology demonstrates significant promise as a path forward for systematically building more accurate force fields that are easily extensible to new chemical domains of interest.
Amortized Sampling with Transferable Normalizing Flows
Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 280 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo on unseen tetrapeptides. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.
Omics-scale polymer computational database transferable to real-world artificial intelligence applications
Developing large-scale foundational datasets is a critical milestone in advancing artificial intelligence (AI)-driven scientific innovation. However, unlike AI-mature fields such as natural language processing, materials science, particularly polymer research, has significantly lagged in developing extensive open datasets. This lag is primarily due to the high costs of polymer synthesis and property measurements, along with the vastness and complexity of the chemical space. This study presents PolyOmics, an omics-scale computational database generated through fully automated molecular dynamics simulation pipelines that provide diverse physical properties for over 10^5 polymeric materials. The PolyOmics database is collaboratively developed by approximately 260 researchers from 48 institutions to bridge the gap between academia and industry. Machine learning models pretrained on PolyOmics can be efficiently fine-tuned for a wide range of real-world downstream tasks, even when only limited experimental data are available. Notably, the generalisation capability of these simulation-to-real transfer models improve significantly as the size of the PolyOmics database increases, exhibiting power-law scaling. The emergence of scaling laws supports the "more is better" principle, highlighting the significance of ultralarge-scale computational materials data for improving real-world prediction performance. This unprecedented omics-scale database reveals vast unexplored regions of polymer materials, providing a foundation for AI-driven polymer science.
Chemical classification program synthesis using generative artificial intelligence
Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or the use of deep learning methods that lack explainability. This work presents an approach that uses generative artificial intelligence to automatically write chemical classifier programs for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO). We validated our approach against the ChEBI database, and compared our results against state of the art deep learning models. We also demonstrate the use of C3PO to classify out-of-distribution examples taken from metabolomics repositories and natural product databases. We also demonstrate the potential use of our approach to find systematic classification errors in existing chemical databases, and show how an ensemble artificial intelligence approach combining generated ontologies, automated literature search, and multimodal vision models can be used to pinpoint potential errors requiring expert validation
Shoot from the HIP: Hessian Interatomic Potentials without derivatives
Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree l=2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at https://github.com/BurgerAndreas/hip
Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning
Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.
Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design
The efficient exploration of chemical space to design molecules with intended properties enables the accelerated discovery of drugs, materials, and catalysts, and is one of the most important outstanding challenges in chemistry. Encouraged by the recent surge in computer power and artificial intelligence development, many algorithms have been developed to tackle this problem. However, despite the emergence of many new approaches in recent years, comparatively little progress has been made in developing realistic benchmarks that reflect the complexity of molecular design for real-world applications. In this work, we develop a set of practical benchmark tasks relying on physical simulation of molecular systems mimicking real-life molecular design problems for materials, drugs, and chemical reactions. Additionally, we demonstrate the utility and ease of use of our new benchmark set by demonstrating how to compare the performance of several well-established families of algorithms. Surprisingly, we find that model performance can strongly depend on the benchmark domain. We believe that our benchmark suite will help move the field towards more realistic molecular design benchmarks, and move the development of inverse molecular design algorithms closer to designing molecules that solve existing problems in both academia and industry alike.
MoleculeNet: A Benchmark for Molecular Machine Learning
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.
De novo design of high-affinity protein binders with AlphaProteo
Computational design of protein-binding proteins is a fundamental capability with broad utility in biomedical research and biotechnology. Recent methods have made strides against some target proteins, but on-demand creation of high-affinity binders without multiple rounds of experimental testing remains an unsolved challenge. This technical report introduces AlphaProteo, a family of machine learning models for protein design, and details its performance on the de novo binder design problem. With AlphaProteo, we achieve 3- to 300-fold better binding affinities and higher experimental success rates than the best existing methods on seven target proteins. Our results suggest that AlphaProteo can generate binders "ready-to-use" for many research applications using only one round of medium-throughput screening and no further optimization.
Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model
The recent success of large foundation models in artificial intelligence has prompted the emergence of chemical pre-trained models. Despite the growing interest in large molecular pre-trained models that provide informative representations for downstream tasks, attempts for multimodal pre-training approaches on the molecule domain were limited. To address this, we present a novel multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties, drawing inspiration from recent advances in multimodal learning techniques. Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space, which enables the model to regard bidirectional information between the molecules' structure and properties. These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model. Through extensive experiments, we demonstrate that our model shows remarkable capabilities in solving various meaningful chemical challenges, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.
Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs
Molecular representation is a foundational element in our understanding of the physical world. Its importance ranges from the fundamentals of chemical reactions to the design of new therapies and materials. Previous molecular machine learning models have employed strings, fingerprints, global features, and simple molecular graphs that are inherently information-sparse representations. However, as the complexity of prediction tasks increases, the molecular representation needs to encode higher fidelity information. This work introduces a novel approach to infusing quantum-chemical-rich information into molecular graphs via stereoelectronic effects. We show that the explicit addition of stereoelectronic interactions significantly improves the performance of molecular machine learning models. Furthermore, stereoelectronics-infused representations can be learned and deployed with a tailored double graph neural network workflow, enabling its application to any downstream molecular machine learning task. Finally, we show that the learned representations allow for facile stereoelectronic evaluation of previously intractable systems, such as entire proteins, opening new avenues of molecular design.
Polyatomic Complexes: A topologically-informed learning representation for atomistic systems
Developing robust representations of chemical structures that enable models to learn topological inductive biases is challenging. In this manuscript, we present a representation of atomistic systems. We begin by proving that our representation satisfies all structural, geometric, efficiency, and generalizability constraints. Afterward, we provide a general algorithm to encode any atomistic system. Finally, we report performance comparable to state-of-the-art methods on numerous tasks. We open-source all code and datasets. The code and data are available at https://github.com/rahulkhorana/PolyatomicComplexes.
A Vector-Based Algorithm for Generating Complete Balanced Reaction Sets with Arbitrary Numbers of Reagents
We present a vector-based method to balance chemical reactions. The algorithm builds candidates in a deterministic way, removes duplicates, and always prints coefficients in the lowest whole-number form. For redox cases, electrons and protons/hydroxide are treated explicitly, so both mass and charge are balanced. We also outline the basic principles of the vector formulation of stoichiometry, interpreting reactions as integer vectors in composition space, this geometric view supports compact visualizations of reagent-product interactions and helps surface distinct reaction families. The method enumerates valid balances for arbitrary user-specified species lists without special-case balancing rules or symbolic tricks, and it provides a clean foundation for developing new algorithmic variants (e.g., alternative objectives or constraints). On representative examples (neutralization, double displacement, decomposition, classical redox, small multicomponent sets) and a negative control, the method produced correct integer balances. When multiple balances exist, we report a canonical one - minimizing the total coefficient sum with a simple tie-breaker - without claiming global optimality beyond the solutions the search enumerates. The procedure applies per reaction and extends to reaction networks via consistent per-reaction application. We do not report runtimes, broader benchmarking and code/data release are planned.
What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks
Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been rapidly applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper,we establish a comprehensive benchmark containing 8 practical chemistry tasks, including 1) name prediction, 2) property prediction, 3) yield prediction, 4) reaction prediction, 5) retrosynthesis (prediction of reactants from products), 6)text-based molecule design, 7) molecule captioning, and 8) reagent selection. Our analysis draws on widely recognized datasets including BBBP, Tox21, PubChem, USPTO, and ChEBI, facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Three GPT models (GPT-4, GPT-3.5,and Davinci-003) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. The key results of our investigation are 1) GPT-4 outperforms the other two models among the three evaluated; 2) GPT models exhibit less competitive performance in tasks demanding precise understanding of molecular SMILES representation, such as reaction prediction and retrosynthesis;3) GPT models demonstrate strong capabilities in text-related explanation tasks such as molecule captioning; and 4) GPT models exhibit comparable or better performance to classical machine learning models when applied to chemical problems that can be transformed into classification or ranking tasks, such as property prediction, and yield prediction.
Protein-ligand binding representation learning from fine-grained interactions
The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different from existing pre-training approaches which treat proteins and ligands individually, we emphasize to discern the intricate binding patterns from fine-grained interactions. Specifically, this self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure given a pocket and ligand with a Transformer based interaction module, which naturally emulates the binding process. To ensure the representation of rich binding information, we introduce two pre-training tasks, i.e.~atomic pairwise distance map prediction and mask ligand reconstruction, which comprehensively model the fine-grained interactions from both structure and feature space. Extensive experiments have demonstrated the superiority of our method across various binding tasks, including protein-ligand affinity prediction, virtual screening and protein-ligand docking.
2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision
Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model's ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available on Huggingface and Github.
Generative Pretrained Autoregressive Transformer Graph Neural Network applied to the Analysis and Discovery of Novel Proteins
We report a flexible language-model based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural proteins, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform 8 distinct tasks, with available datasets it can be extended to solve additional problems. In a broader sense, this work illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties via a synergizing learning capacity to express a set of potentialities embedded in the knowledge used in training, via the interplay of universality and diversity.
Enhanced Sampling, Public Dataset and Generative Model for Drug-Protein Dissociation Dynamics
Drug-protein binding and dissociation dynamics are fundamental to understanding molecular interactions in biological systems. While many tools for drug-protein interaction studies have emerged, especially artificial intelligence (AI)-based generative models, predictive tools on binding/dissociation kinetics and dynamics are still limited. We propose a novel research paradigm that combines molecular dynamics (MD) simulations, enhanced sampling, and AI generative models to address this issue. We propose an enhanced sampling strategy to efficiently implement the drug-protein dissociation process in MD simulations and estimate the free energy surface (FES). We constructed a program pipeline of MD simulations based on this sampling strategy, thus generating a dataset including 26,612 drug-protein dissociation trajectories containing about 13 million frames. We named this dissociation dynamics dataset DD-13M and used it to train a deep equivariant generative model UnbindingFlow, which can generate collision-free dissociation trajectories. The DD-13M database and UnbindingFlow model represent a significant advancement in computational structural biology, and we anticipate its broad applicability in machine learning studies of drug-protein interactions. Our ongoing efforts focus on expanding this methodology to encompass a broader spectrum of drug-protein complexes and exploring novel applications in pathway prediction.
Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions, each curated to enhance the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on the representative LLM, we underscore the potency of Mol-Instructions to enhance the adaptability and cognitive acuity of large models within the complex sphere of biomolecular studies, thereby promoting advancements in the biomolecular research community. Mol-Instructions is made publicly accessible for future research endeavors and will be subjected to continual updates for enhanced applicability.
Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding
Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. In this framework, we propose a novel two-stage instruction tuning pipeline that first establishes a basic understanding of proteins through caption-based instructions and then refines this understanding using a mixture of experts (MoEs) to learn more complex properties and functional information with the same amount of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experimental results on open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
Peptide Sequencing Via Protein Language Models
We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. To date, protein sequencing relies on mass spectrometry, with some novel edman degregation based platforms able to sequence non-native peptides. Current protein sequencing techniques face limitations in accurately identifying all amino acids, hindering comprehensive proteome analysis. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then modify and finetune a ProtBert derived transformer-based model, for a new downstream task predicting these masked residues, providing an approximation of the complete sequence. Evaluating on three bacterial Escherichia species, we achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM]) are known. Structural assessment using AlphaFold and TM-score validates the biological relevance of our predictions. The model also demonstrates potential for evolutionary analysis through cross-species performance. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, potentially accelerating advancements in proteomics and structural biology by providing a probabilistic reconstruction of the complete protein sequence from limited experimental data.
The Open Catalyst 2020 (OC20) Dataset and Community Challenges
Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.
Symmetry-invariant quantum machine learning force fields
Machine learning techniques are essential tools to compute efficient, yet accurate, force fields for atomistic simulations. This approach has recently been extended to incorporate quantum computational methods, making use of variational quantum learning models to predict potential energy surfaces and atomic forces from ab initio training data. However, the trainability and scalability of such models are still limited, due to both theoretical and practical barriers. Inspired by recent developments in geometric classical and quantum machine learning, here we design quantum neural networks that explicitly incorporate, as a data-inspired prior, an extensive set of physically relevant symmetries. We find that our invariant quantum learning models outperform their more generic counterparts on individual molecules of growing complexity. Furthermore, we study a water dimer as a minimal example of a system with multiple components, showcasing the versatility of our proposed approach and opening the way towards larger simulations. Our results suggest that molecular force fields generation can significantly profit from leveraging the framework of geometric quantum machine learning, and that chemical systems represent, in fact, an interesting and rich playground for the development and application of advanced quantum machine learning tools.
Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker--Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker--Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
Persistent-Homology-based Machine Learning and its Applications -- A Survey
A suitable feature representation that can both preserve the data intrinsic information and reduce data complexity and dimensionality is key to the performance of machine learning models. Deeply rooted in algebraic topology, persistent homology (PH) provides a delicate balance between data simplification and intrinsic structure characterization, and has been applied to various areas successfully. However, the combination of PH and machine learning has been hindered greatly by three challenges, namely topological representation of data, PH-based distance measurements or metrics, and PH-based feature representation. With the development of topological data analysis, progresses have been made on all these three problems, but widely scattered in different literatures. In this paper, we provide a systematical review of PH and PH-based supervised and unsupervised models from a computational perspective. Our emphasizes are the recent development of mathematical models and tools, including PH softwares and PH-based functions, feature representations, kernels, and similarity models. Essentially, this paper can work as a roadmap for the practical application of PH-based machine learning tools. Further, we consider different topological feature representations in different machine learning models, and investigate their impacts on the protein secondary structure classification.
ChemCrow: Augmenting large-language models with chemistry tools
Over the last decades, excellent computational chemistry tools have been developed. Their full potential has not yet been reached as most are challenging to learn and exist in isolation. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 17 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned the syntheses of an insect repellent, three organocatalysts, as well as other relevant molecules. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow's performance. There is a significant risk of misuse of tools like ChemCrow, and we discuss their potential harms. Employed responsibly, our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry. A subset of the code is publicly available at https://github.com/ur-whitelab/chemcrow-public.
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.
SELFormer: Molecular Representation Learning via SELFIES Language Models
Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
Omni-Mol: Exploring Universal Convergent Space for Omni-Molecular Tasks
Building generalist models has recently demonstrated remarkable capabilities in diverse scientific domains. Within the realm of molecular learning, several studies have explored unifying diverse tasks across diverse domains. However, negative conflicts and interference between molecules and knowledge from different domain may have a worse impact in threefold. First, conflicting molecular representations can lead to optimization difficulties for the models. Second, mixing and scaling up training data across diverse tasks is inherently challenging. Third, the computational cost of refined pretraining is prohibitively high. To address these limitations, this paper presents Omni-Mol, a scalable and unified LLM-based framework for direct instruction tuning. Omni-Mol builds on three key components to tackles conflicts: (1) a unified encoding mechanism for any task input; (2) an active-learning-driven data selection strategy that significantly reduces dataset size; (3) a novel design of the adaptive gradient stabilization module and anchor-and-reconcile MoE framework that ensures stable convergence. Experimentally, Omni-Mol achieves state-of-the-art performance across 15 molecular tasks, demonstrates the presence of scaling laws in the molecular domain, and is supported by extensive ablation studies and analyses validating the effectiveness of its design. The code and weights of the powerful AI-driven chemistry generalist are open-sourced at: https://anonymous.4open.science/r/Omni-Mol-8EDB.
Accurate and scalable exchange-correlation with deep learning
Density Functional Theory (DFT) is the most widely used electronic structure method for predicting the properties of molecules and materials. Although DFT is, in principle, an exact reformulation of the Schr\"odinger equation, practical applications rely on approximations to the unknown exchange-correlation (XC) functional. Most existing XC functionals are constructed using a limited set of increasingly complex, hand-crafted features that improve accuracy at the expense of computational efficiency. Yet, no current approximation achieves the accuracy and generality for predictive modeling of laboratory experiments at chemical accuracy -- typically defined as errors below 1 kcal/mol. In this work, we present Skala, a modern deep learning-based XC functional that bypasses expensive hand-designed features by learning representations directly from data. Skala achieves chemical accuracy for atomization energies of small molecules while retaining the computational efficiency typical of semi-local DFT. This performance is enabled by training on an unprecedented volume of high-accuracy reference data generated using computationally intensive wavefunction-based methods. Notably, Skala systematically improves with additional training data covering diverse chemistry. By incorporating a modest amount of additional high-accuracy data tailored to chemistry beyond atomization energies, Skala achieves accuracy competitive with the best-performing hybrid functionals across general main group chemistry, at the cost of semi-local DFT. As the training dataset continues to expand, Skala is poised to further enhance the predictive power of first-principles simulations.
Tx-LLM: A Large Language Model for Therapeutics
Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.
QuantumBind-RBFE: Accurate Relative Binding Free Energy Calculations Using Neural Network Potentials
Accurate prediction of protein-ligand binding affinities is crucial in drug discovery, particularly during hit-to-lead and lead optimization phases, however, limitations in ligand force fields continue to impact prediction accuracy. In this work, we validate relative binding free energy (RBFE) accuracy using neural network potentials (NNPs) for the ligands. We utilize a novel NNP model, AceForce 1.0, based on the TensorNet architecture for small molecules that broadens the applicability to diverse drug-like compounds, including all important chemical elements and supporting charged molecules. Using established benchmarks, we show overall improved accuracy and correlation in binding affinity predictions compared with GAFF2 for molecular mechanics and ANI2-x for NNPs. Slightly less accuracy but comparable correlations with OPLS4. We also show that we can run the NNP simulations at 2 fs timestep, at least two times larger than previous NNP models, providing significant speed gains. The results show promise for further evolutions of free energy calculations using NNPs while demonstrating its practical use already with the current generation. The code and NNP model are publicly available for research use.
Generative Modeling of Molecular Dynamics Trajectories
Molecular dynamics (MD) is a powerful technique for studying microscopic phenomena, but its computational cost has driven significant interest in the development of deep learning-based surrogate models. We introduce generative modeling of molecular trajectories as a paradigm for learning flexible multi-task surrogate models of MD from data. By conditioning on appropriately chosen frames of the trajectory, we show such generative models can be adapted to diverse tasks such as forward simulation, transition path sampling, and trajectory upsampling. By alternatively conditioning on part of the molecular system and inpainting the rest, we also demonstrate the first steps towards dynamics-conditioned molecular design. We validate the full set of these capabilities on tetrapeptide simulations and show that our model can produce reasonable ensembles of protein monomers. Altogether, our work illustrates how generative modeling can unlock value from MD data towards diverse downstream tasks that are not straightforward to address with existing methods or even MD itself. Code is available at https://github.com/bjing2016/mdgen.
Junction Tree Variational Autoencoder for Molecular Graph Generation
We seek to automate the design of molecules based on specific chemical properties. In computational terms, this task involves continuous embedding and generation of molecular graphs. Our primary contribution is the direct realization of molecular graphs, a task previously approached by generating linear SMILES strings instead of graphs. Our junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network. This approach allows us to incrementally expand molecules while maintaining chemical validity at every step. We evaluate our model on multiple tasks ranging from molecular generation to optimization. Across these tasks, our model outperforms previous state-of-the-art baselines by a significant margin.
Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches
Gaussian process (GP) regression provides a strategy for accelerating saddle point searches on high-dimensional energy surfaces by reducing the number of times the energy and its derivatives with respect to atomic coordinates need to be evaluated. The computational overhead in the hyperparameter optimization can, however, be large and make the approach inefficient. Failures can also occur if the search ventures too far into regions that are not represented well enough by the GP model. Here, these challenges are resolved by using geometry-aware optimal transport measures and an active pruning strategy using a summation over Wasserstein-1 distances for each atom-type in farthest-point sampling, selecting a fixed-size subset of geometrically diverse configurations to avoid rapidly increasing cost of GP updates as more observations are made. Stability is enhanced by permutation-invariant metric that provides a reliable trust radius for early-stopping and a logarithmic barrier penalty for the growth of the signal variance. These physically motivated algorithmic changes prove their efficacy by reducing to less than a half the mean computational time on a set of 238 challenging configurations from a previously published data set of chemical reactions. With these improvements, the GP approach is established as, a robust and scalable algorithm for accelerating saddle point searches when the evaluation of the energy and atomic forces requires significant computational effort.
Hybrid Quantum Generative Adversarial Networks for Molecular Simulation and Drug Discovery
In molecular research, simulation \& design of molecules are key areas with significant implications for drug development, material science, and other fields. Current classical computational power falls inadequate to simulate any more than small molecules, let alone protein chains on hundreds of peptide. Therefore these experiment are done physically in wet-lab, but it takes a lot of time \& not possible to examine every molecule due to the size of the search area, tens of billions of dollars are spent every year in these research experiments. Molecule simulation \& design has lately advanced significantly by machine learning models, A fresh perspective on the issue of chemical synthesis is provided by deep generative models for graph-structured data. By optimising differentiable models that produce molecular graphs directly, it is feasible to avoid costly search techniques in the discrete and huge space of chemical structures. But these models also suffer from computational limitations when dimensions become huge and consume huge amount of resources. Quantum Generative machine learning in recent years have shown some empirical results promising significant advantages over classical counterparts.
P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching
Biological processes, functions, and properties are intricately linked to the ensemble of protein conformations, rather than being solely determined by a single stable conformation. In this study, we have developed P2DFlow, a generative model based on SE(3) flow matching, to predict the structural ensembles of proteins. We specifically designed a valuable prior for the flow process and enhanced the model's ability to distinguish each intermediate state by incorporating an additional dimension to describe the ensemble data, which can reflect the physical laws governing the distribution of ensembles, so that the prior knowledge can effectively guide the generation process. When trained and evaluated on the MD datasets of ATLAS, P2DFlow outperforms other baseline models on extensive experiments, successfully capturing the observable dynamic fluctuations as evidenced in crystal structure and MD simulations. As a potential proxy agent for protein molecular simulation, the high-quality ensembles generated by P2DFlow could significantly aid in understanding protein functions across various scenarios. Code is available at https://github.com/BLEACH366/P2DFlow
Chem-R: Learning to Reason as a Chemist
Although large language models (LLMs) have significant potential to advance chemical discovery, current LLMs lack core chemical knowledge, produce unreliable reasoning trajectories, and exhibit suboptimal performance across diverse chemical tasks. To address these challenges, we propose Chem-R, a generalizable Chemical Reasoning model designed to emulate the deliberative processes of chemists. Chem-R is trained through a three-phase framework that progressively builds advanced reasoning capabilities, including: 1) Chemical Foundation Training, which establishes core chemical knowledge. 2) Chemical Reasoning Protocol Distillation, incorporating structured, expert-like reasoning traces to guide systematic and reliable problem solving. 3) Multi-task Group Relative Policy Optimization that optimizes the model for balanced performance across diverse molecular- and reaction-level tasks. This structured pipeline enables Chem-R to achieve state-of-the-art performance on comprehensive benchmarks, surpassing leading large language models, including Gemini-2.5-Pro and DeepSeek-R1, by up to 46% on molecular tasks and 66% on reaction tasks. Meanwhile, Chem-R also consistently outperforms the existing chemical foundation models across both molecular and reaction level tasks. These results highlight Chem-R's robust generalization, interpretability, and potential as a foundation for next-generation AI-driven chemical discovery.
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent
A smile is all you need: Predicting limiting activity coefficients from SMILES with natural language processing
Knowledge of mixtures' phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients is often limited due to high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce the SMILES-to-Properties-Transformer (SPT), a natural language processing network to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 Million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS, UNIFAC, and improving on recent machine learning approaches.
Spherical Channels for Modeling Atomic Interactions
Modeling the energy and forces of atomic systems is a fundamental problem in computational chemistry with the potential to help address many of the world's most pressing problems, including those related to energy scarcity and climate change. These calculations are traditionally performed using Density Functional Theory, which is computationally very expensive. Machine learning has the potential to dramatically improve the efficiency of these calculations from days or hours to seconds. We propose the Spherical Channel Network (SCN) to model atomic energies and forces. The SCN is a graph neural network where nodes represent atoms and edges their neighboring atoms. The atom embeddings are a set of spherical functions, called spherical channels, represented using spherical harmonics. We demonstrate, that by rotating the embeddings based on the 3D edge orientation, more information may be utilized while maintaining the rotational equivariance of the messages. While equivariance is a desirable property, we find that by relaxing this constraint in both message passing and aggregation, improved accuracy may be achieved. We demonstrate state-of-the-art results on the large-scale Open Catalyst dataset in both energy and force prediction for numerous tasks and metrics.
T-Rex: Text-assisted Retrosynthesis Prediction
As a fundamental task in computational chemistry, retrosynthesis prediction aims to identify a set of reactants to synthesize a target molecule. Existing template-free approaches only consider the graph structures of the target molecule, which often cannot generalize well to rare reaction types and large molecules. Here, we propose T-Rex, a text-assisted retrosynthesis prediction approach that exploits pre-trained text language models, such as ChatGPT, to assist the generation of reactants. T-Rex first exploits ChatGPT to generate a description for the target molecule and rank candidate reaction centers based both the description and the molecular graph. It then re-ranks these candidates by querying the descriptions for each reactants and examines which group of reactants can best synthesize the target molecule. We observed that T-Rex substantially outperformed graph-based state-of-the-art approaches on two datasets, indicating the effectiveness of considering text information. We further found that T-Rex outperformed the variant that only use ChatGPT-based description without the re-ranking step, demonstrate how our framework outperformed a straightforward integration of ChatGPT and graph information. Collectively, we show that text generated by pre-trained language models can substantially improve retrosynthesis prediction, opening up new avenues for exploiting ChatGPT to advance computational chemistry. And the codes can be found at https://github.com/lauyikfung/T-Rex.
UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
BAPULM: Binding Affinity Prediction using Language Models
Identifying drug-target interactions is essential for developing effective therapeutics. Binding affinity quantifies these interactions, and traditional approaches rely on computationally intensive 3D structural data. In contrast, language models can efficiently process sequential data, offering an alternative approach to molecular representation. In the current study, we introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations. Our approach was validated extensively on benchmark datasets, achieving scoring power (R) values of 0.925 pm 0.043, 0.914 pm 0.004, and 0.8132 pm 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively. These findings indicate the robustness and accuracy of BAPULM across diverse datasets and underscore the potential of sequence-based models in-silico drug discovery, offering a scalable alternative to 3D-centric methods for screening potential ligands.
Large-Scale Chemical Language Representations Capture Molecular Structure and Properties
Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation
High-throughput reaction condition (RC) screening is fundamental to chemical synthesis. However, current RC screening suffers from laborious and costly trial-and-error workflows. Traditional computer-aided synthesis planning (CASP) tools fail to find suitable RCs due to data sparsity and inadequate reaction representations. Nowadays, large language models (LLMs) are capable of tackling chemistry-related problems, such as molecule design, and chemical logic Q\&A tasks. However, LLMs have not yet achieved accurate predictions of chemical reaction conditions. Here, we present MM-RCR, a text-augmented multimodal LLM that learns a unified reaction representation from SMILES, reaction graphs, and textual corpus for chemical reaction recommendation (RCR). To train MM-RCR, we construct 1.2 million pair-wised Q\&A instruction datasets. Our experimental results demonstrate that MM-RCR achieves state-of-the-art performance on two open benchmark datasets and exhibits strong generalization capabilities on out-of-domain (OOD) and High-Throughput Experimentation (HTE) datasets. MM-RCR has the potential to accelerate high-throughput condition screening in chemical synthesis.
Neural Message Passing for Quantum Chemistry
Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.
Reasoning-Driven Retrosynthesis Prediction with Large Language Models via Reinforcement Learning
Retrosynthesis planning, essential in organic synthesis and drug discovery, has greatly benefited from recent AI-driven advancements. Nevertheless, existing methods frequently face limitations in both applicability and explainability. Traditional graph-based and sequence-to-sequence models often lack generalized chemical knowledge, leading to predictions that are neither consistently accurate nor easily explainable. To address these challenges, we introduce RetroDFM-R, a reasoning-based large language model (LLM) designed specifically for chemical retrosynthesis. Leveraging large-scale reinforcement learning guided by chemically verifiable rewards, RetroDFM-R significantly enhances prediction accuracy and explainability. Comprehensive evaluations demonstrate that RetroDFM-R significantly outperforms state-of-the-art methods, achieving a top-1 accuracy of 65.0% on the USPTO-50K benchmark. Double-blind human assessments further validate the chemical plausibility and practical utility of RetroDFM-R's predictions. RetroDFM-R also accurately predicts multistep retrosynthetic routes reported in the literature for both real-world drug molecules and perovskite materials. Crucially, the model's explicit reasoning process provides human-interpretable insights, thereby enhancing trust and practical value in real-world retrosynthesis applications.
Small Molecule Optimization with Large Language Models
Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins
While deep learning has revolutionized the prediction of rigid protein structures, modelling the conformational ensembles of Intrinsically Disordered Proteins (IDPs) remains a key frontier. Current AI paradigms present a trade-off: Protein Language Models (PLMs) capture evolutionary statistics but lack explicit physical grounding, while generative models trained to model full ensembles are computationally expensive. In this work we critically assess these limits and propose a path forward. We introduce GeoGraph, a simulation-informed surrogate trained to predict ensemble-averaged statistics of residue-residue contact-map topology directly from sequence. By featurizing coarse-grained molecular dynamics simulations into residue- and sequence-level graph descriptors, we create a robust and information-rich learning target. Our evaluation demonstrates that this approach yields representations that are more predictive of key biophysical properties than existing methods.
ChemBERTa-2: Towards Chemical Foundation Models
Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with these pretraining improvements, we are competitive with existing state-of-the-art architectures on the MoleculeNet benchmark suite. We analyze the degree to which improvements in pretraining translate to improvement on downstream tasks.
Are Protein Language Models Compute Optimal?
While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases, and we identify a performance plateau in training loss comparable to the one found in relevant works in the field. Our findings suggest that widely-used pLMs might not be compute-optimal, indicating that larger models could achieve convergence more efficiently. Training a 35M model on a reduced token set, we attained perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
Foundation Models for Discovery and Exploration in Chemical Space
Accurate prediction of atomistic, thermodynamic, and kinetic properties from molecular structures underpins materials innovation. Existing computational and experimental approaches lack the scalability required to efficiently navigate chemical space. Scientific foundation models trained on large unlabeled datasets offer a path toward exploring chemical space across diverse application domains. Here we develop MIST, a family of molecular foundation models with up to an order of magnitude more parameters and data than prior works. Trained using a novel tokenization scheme that comprehensively captures nuclear, electronic, and geometric information, MIST learns from a diverse range of molecules. MIST models have been fine-tuned to predict more than 400 structure -- property relationships and match or exceed state-of-the-art performance across benchmarks spanning physiology, electrochemistry, and quantum chemistry. We demonstrate the ability of these models to solve real-world problems across chemical space, including multiobjective electrolyte solvent screening, olfactory perception mapping, isotope half-life prediction, stereochemical reasoning for chiral organometallic compounds, and binary and multi-component mixture property prediction. Probing MIST models using mechanistic interpretability methods reveals identifiable patterns and trends not explicitly present in the training data, suggesting that the models learn generalizable scientific concepts. We formulate hyperparameter-penalized Bayesian neural scaling laws and use them to reduce the computational cost of model development by an order of magnitude. The methods and findings presented here represent a significant step toward accelerating materials discovery, design, and optimization using foundation models and provide valuable guidance for training compute-optimal scientific foundation models.
Computational design of target-specific linear peptide binders with TransformerBeta
The computational prediction and design of peptide binders targeting specific linear epitopes is crucial in biological and biomedical research, yet it remains challenging due to their highly dynamic nature and the scarcity of experimentally solved binding data. To address this problem, we built an unprecedentedly large-scale library of peptide pairs within stable secondary structures (beta sheets), leveraging newly available AlphaFold predicted structures. We then developed a machine learning method based on the Transformer architecture for the design of specific linear binders, in analogy to a language translation task. Our method, TransformerBeta, accurately predicts specific beta strand interactions and samples sequences with beta sheet-like molecular properties, while capturing interpretable physico-chemical interaction patterns. As such, it can propose specific candidate binders targeting linear epitope for experimental validation to inform protein design.
MotifBench: A standardized protein design benchmark for motif-scaffolding problems
The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed-backbone sequence design methods. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at github.com/blt2114/MotifBench. The MotifBench test cases are more difficult compared to earlier benchmarks, and include protein design problems for which solutions are known but on which, to the best of our knowledge, state-of-the-art methods fail to identify any solution.
Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language
Activity and property prediction models are the central workhorses in drug discovery and materials sciences, but currently they have to be trained or fine-tuned for new tasks. Without training or fine-tuning, scientific language models could be used for such low-data tasks through their announced zero- and few-shot capabilities. However, their predictive quality at activity prediction is lacking. In this work, we envision a novel type of activity prediction model that is able to adapt to new prediction tasks at inference time, via understanding textual information describing the task. To this end, we propose a new architecture with separate modules for chemical and natural language inputs, and a contrastive pre-training objective on data from large biochemical databases. In extensive experiments, we show that our method CLAMP yields improved predictive performance on few-shot learning benchmarks and zero-shot problems in drug discovery. We attribute the advances of our method to the modularized architecture and to our pre-training objective.
SpecTUS: Spectral Translator for Unknown Structures annotation from EI-MS spectra
Compound identification and structure annotation from mass spectra is a well-established task widely applied in drug detection, criminal forensics, small molecule biomarker discovery and chemical engineering. We propose SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS). Our model analyzes the spectra in de novo manner -- a direct translation from the spectra into 2D-structural representation. Our approach is particularly useful for analyzing compounds unavailable in spectral libraries. In a rigorous evaluation of our model on the novel structure annotation task across different libraries, we outperformed standard database search techniques by a wide margin. On a held-out testing set, including 28267 spectra from the NIST database, we show that our model's single suggestion perfectly reconstructs 43\% of the subset's compounds. This single suggestion is strictly better than the candidate of the database hybrid search (common method among practitioners) in 76\% of cases. In a~still affordable scenario of~10 suggestions, perfect reconstruction is achieved in 65\%, and 84\% are better than the hybrid search.
Evaluating Protein Transfer Learning with TAPE
Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
Highly Accurate Quantum Chemical Property Prediction with Uni-Mol+
Recent developments in deep learning have made remarkable progress in speeding up the prediction of quantum chemical (QC) properties by removing the need for expensive electronic structure calculations like density functional theory. However, previous methods learned from 1D SMILES sequences or 2D molecular graphs failed to achieve high accuracy as QC properties primarily depend on the 3D equilibrium conformations optimized by electronic structure methods, far different from the sequence-type and graph-type data. In this paper, we propose a novel approach called Uni-Mol+ to tackle this challenge. Uni-Mol+ first generates a raw 3D molecule conformation from inexpensive methods such as RDKit. Then, the raw conformation is iteratively updated to its target DFT equilibrium conformation using neural networks, and the learned conformation will be used to predict the QC properties. To effectively learn this update process towards the equilibrium conformation, we introduce a two-track Transformer model backbone and train it with the QC property prediction task. We also design a novel approach to guide the model's training process. Our extensive benchmarking results demonstrate that the proposed Uni-Mol+ significantly improves the accuracy of QC property prediction in various datasets. We have made the code and model publicly available at https://github.com/dptech-corp/Uni-Mol.
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language
Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.
S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search
Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for AUROC, BEDROC and EF.
ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment
Pocket representations play a vital role in various biomedical applications, such as druggability estimation, ligand affinity prediction, and de novo drug design. While existing geometric features and pretrained representations have demonstrated promising results, they usually treat pockets independent of ligands, neglecting the fundamental interactions between them. However, the limited pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hampers large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pretrained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pretrained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.
GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition
Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.
From Words to Molecules: A Survey of Large Language Models in Chemistry
In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, paving the way for groundbreaking developments in the field.
Analyzing Learned Molecular Representations for Property Prediction
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.
Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching
We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.
AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis
Large-scale datasets have enabled highly accurate machine learning interatomic potentials (MLIPs) for general-purpose heterogeneous catalysis modeling. There are, however, some limitations in what can be treated with these potentials because of gaps in the underlying training data. To extend these capabilities, we introduce AQCat25, a complementary dataset of 13.5 million density functional theory (DFT) single point calculations designed to improve the treatment of systems where spin polarization and/or higher fidelity are critical. We also investigate methodologies for integrating new datasets, such as AQCat25, with the broader Open Catalyst 2020 (OC20) dataset to create spin-aware models without sacrificing generalizability. We find that directly tuning a general model on AQCat25 leads to catastrophic forgetting of the original dataset's knowledge. Conversely, joint training strategies prove effective for improving accuracy on the new data without sacrificing general performance. This joint approach introduces a challenge, as the model must learn from a dataset containing both mixed-fidelity calculations and mixed-physics (spin-polarized vs. unpolarized). We show that explicitly conditioning the model on this system-specific metadata, for example by using Feature-wise Linear Modulation (FiLM), successfully addresses this challenge and further enhances model accuracy. Ultimately, our work establishes an effective protocol for bridging DFT fidelity domains to advance the predictive power of foundational models in catalysis.
Benchmarking Retrieval-Augmented Generation for Chemistry
Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.
Endowing Protein Language Models with Structural Knowledge
Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein language models have emerged as the preferred method for this challenge, thanks to their ability to harness large sequence databases. Yet, their reliance on expansive sequence data and parameter sets limits their flexibility and practicality in real-world scenarios. Concurrently, the recent surge in computationally predicted protein structures unlocks new opportunities in protein representation learning. While promising, the computational burden carried by such complex data still hinders widely-adopted practical applications. To address these limitations, we introduce a novel framework that enhances protein language models by integrating protein structural data. Drawing from recent advances in graph transformers, our approach refines the self-attention mechanisms of pretrained language transformers by integrating structural information with structure extractor modules. This refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database, using the same masked language modeling objective as traditional protein language models. Empirical evaluations of PST demonstrate its superior parameter efficiency relative to protein language models, despite being pretrained on a dataset comprising only 542K structures. Notably, PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction. Our findings underscore the potential of integrating structural information into protein language models, paving the way for more effective and efficient protein modeling Code and pretrained models are available at https://github.com/BorgwardtLab/PST.
ReactionT5: a large-scale pre-trained model towards application of limited reaction data
Transformer-based deep neural networks have revolutionized the field of molecular-related prediction tasks by treating molecules as symbolic sequences. These models have been successfully applied in various organic chemical applications by pretraining them with extensive compound libraries and subsequently fine-tuning them with smaller in-house datasets for specific tasks. However, many conventional methods primarily focus on single molecules, with limited exploration of pretraining for reactions involving multiple molecules. In this paper, we propose ReactionT5, a novel model that leverages pretraining on the Open Reaction Database (ORD), a publicly available large-scale resource. We further fine-tune this model for yield prediction and product prediction tasks, demonstrating its impressive performance even with limited fine-tuning data compared to traditional models. The pre-trained ReactionT5 model is publicly accessible on the Hugging Face platform.
Bootstrap Embedding on a Quantum Computer
We extend molecular bootstrap embedding to make it appropriate for implementation on a quantum computer. This enables solution of the electronic structure problem of a large molecule as an optimization problem for a composite Lagrangian governing fragments of the total system, in such a way that fragment solutions can harness the capabilities of quantum computers. By employing state-of-art quantum subroutines including the quantum SWAP test and quantum amplitude amplification, we show how a quadratic speedup can be obtained over the classical algorithm, in principle. Utilization of quantum computation also allows the algorithm to match -- at little additional computational cost -- full density matrices at fragment boundaries, instead of being limited to 1-RDMs. Current quantum computers are small, but quantum bootstrap embedding provides a potentially generalizable strategy for harnessing such small machines through quantum fragment matching.
ChemAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .
Structured Chemistry Reasoning with Large Language Models
This paper studies the problem of solving complex chemistry problems with large language models (LLMs). Despite the extensive general knowledge in LLMs (such as GPT-4), they struggle with chemistry reasoning that requires faithful grounded reasoning with diverse chemical knowledge and an integrative understanding of chemical interactions. We propose InstructChem, a new structured reasoning approach that substantially boosts the LLMs' chemical reasoning capabilities. InstructChem explicitly decomposes the reasoning into three critical phrases, including chemical formulae generation by LLMs that offers the basis for subsequent grounded reasoning, step-by-step reasoning that makes multi-step derivations with the identified formulae for a preliminary answer, and iterative review-and-refinement that steers LLMs to progressively revise the previous phases for increasing confidence, leading to the final high-confidence answer. We conduct extensive experiments on four different chemistry challenges, including quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach significantly enhances GPT-4 on chemistry reasoning, yielding an 8% average absolute improvement and a 30% peak improvement. We further use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe strong improvement of the smaller LMs. This validates our approach and enables LLMs to generate high-quality reasoning.
Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity
Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost--driven by high-order tensor product (TP) operations--restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network, that incorporates adaptive SParsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact. Our code can be found at https://github.com/microsoft/SPHNet.
Leveraging Side Information for Ligand Conformation Generation using Diffusion-Based Approaches
Ligand molecule conformation generation is a critical challenge in drug discovery. Deep learning models have been developed to tackle this problem, particularly through the use of generative models in recent years. However, these models often generate conformations that lack meaningful structure and randomness due to the absence of essential side information. Examples of such side information include the chemical and geometric features of the target protein, ligand-target compound interactions, and ligand chemical properties. Without these constraints, the generated conformations may not be suitable for further selection and design of new drugs. To address this limitation, we propose a novel method for generating ligand conformations that leverage side information and incorporate flexible constraints into standard diffusion models. Drawing inspiration from the concept of message passing, we introduce ligand-target massage passing block, a mechanism that facilitates the exchange of information between target nodes and ligand nodes, thereby incorporating target node features. To capture non-covalent interactions, we introduce ligand-target compound inter and intra edges. To further improve the biological relevance of the generated conformations, we train energy models using scalar chemical features. These models guide the progress of the standard Denoising Diffusion Probabilistic Models, resulting in more biologically meaningful conformations. We evaluate the performance of SIDEGEN using the PDBBind-2020 dataset, comparing it against other methods. The results demonstrate improvements in both Aligned RMSD and Ligand RMSD evaluations. Specifically, our model outperforms GeoDiff (trained on PDBBind-2020) by 20% in terms of the median aligned RMSD metric.
Bridging Quantum Mechanics to Organic Liquid Properties via a Universal Force Field
Molecular dynamics (MD) simulations are essential tools for unraveling atomistic insights into the structure and dynamics of condensed-phase systems. However, the universal and accurate prediction of macroscopic properties from ab initio calculations remains a significant challenge, often hindered by the trade-off between computational cost and simulation accuracy. Here, we present ByteFF-Pol, a graph neural network (GNN)-parameterized polarizable force field, trained exclusively on high-level quantum mechanics (QM) data. Leveraging physically-motivated force field forms and training strategies, ByteFF-Pol exhibits exceptional performance in predicting thermodynamic and transport properties for a wide range of small-molecule liquids and electrolytes, outperforming state-of-the-art (SOTA) classical and machine learning force fields. The zero-shot prediction capability of ByteFF-Pol bridges the gap between microscopic QM calculations and macroscopic liquid properties, enabling the exploration of previously intractable chemical spaces. This advancement holds transformative potential for applications such as electrolyte design and custom-tailored solvent, representing a pivotal step toward data-driven materials discovery.
SimPoly: Simulation of Polymers with Machine Learning Force Fields Derived from First Principles
Polymers are a versatile class of materials with widespread industrial applications. Advanced computational tools could revolutionize their design, but their complex, multi-scale nature poses significant modeling challenges. Conventional force fields often lack the accuracy and transferability required to capture the intricate interactions governing polymer behavior. Conversely, quantum-chemical methods are computationally prohibitive for the large systems and long timescales required to simulate relevant polymer phenomena. Here, we overcome these limitations with a machine learning force field (MLFF) approach. We demonstrate that macroscopic properties for a broad range of polymers can be predicted ab initio, without fitting to experimental data. Specifically, we develop a fast and scalable MLFF to accurately predict polymer densities, outperforming established classical force fields. Our MLFF also captures second-order phase transitions, enabling the prediction of glass transition temperatures. To accelerate progress in this domain, we introduce a benchmark of experimental bulk properties for 130 polymers and an accompanying quantum-chemical dataset. This work lays the foundation for a fully in silico design pipeline for next-generation polymeric materials.
A mesh-free hybrid Chebyshev-Tucker tensor format with applications to multi-particle modelling
In this paper, we introduce a mesh-free two-level hybrid Tucker tensor format for approximation of multivariate functions, which combines the product Chebyshev interpolation with the ALS-based Tucker decomposition of the tensor of Chebyshev coefficients. It allows to avoid the expenses of the rank-structured approximation of function-related tensors defined on large spacial grids, while benefiting from the Tucker decomposition of the rather small core tensor of Chebyshev coefficients. This leads to nearly optimal Tucker rank parameters which are close to the results for well established Tucker-ALS algorithm applied to the large grid-based tensors. These rank parameters inherited from the Tucker-ALS decomposition of the coefficient tensor can be much less than the polynomial degrees of the initial Chebyshev interpolant via function independent basis set. Furthermore, the tensor product Chebyshev polynomials discretized on a tensor grid leads to a low-rank two-level orthogonal algebraic Tucker tensor that approximates the initial function with controllable accuracy. It is shown that our techniques could be gainfully applied to the long-range part of the electrostatic potential of multi-particle systems approximated in the range-separated tensor format. Error and complexity estimates of the proposed methods are presented. We demonstrate the efficiency of the suggested method numerically on examples of the long-range components of multi-particle interaction potentials generated by 3D Newton kernel for large bio-molecule systems and lattice-type compounds.
Towards Data-Efficient Pretraining for Atomic Property Prediction
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
Solvent: A Framework for Protein Folding
Consistency and reliability are crucial for conducting AI research. Many famous research fields, such as object detection, have been compared and validated with solid benchmark frameworks. After AlphaFold2, the protein folding task has entered a new phase, and many methods are proposed based on the component of AlphaFold2. The importance of a unified research framework in protein folding contains implementations and benchmarks to consistently and fairly compare various approaches. To achieve this, we present Solvent, an protein folding framework that supports significant components of state-of-th-arts models in the manner of off-the-shelf interface Solvent contains different models implemented in a unified codebase and supports training and evaluation for defined models on the same dataset. We benchmark well-known algorithms and their components and provide experiments that give helpful insights into the protein structure modeling field. We hope that Solvent will increase the reliability and consistency of proposed models and gives efficiency in both speed and costs, resulting in acceleration on protein folding modeling research. The code is available at https://github.com/kakaobrain/solvent, and the project will continue to be developed.
A Foundational Potential Energy Surface Dataset for Materials
Accurate potential energy surface (PES) descriptions are essential for atomistic simulations of materials. Universal machine learning interatomic potentials (UMLIPs)^{1-3} offer a computationally efficient alternative to density functional theory (DFT)^4 for PES modeling across the periodic table. However, their accuracy today is fundamentally constrained due to a reliance on DFT relaxation data.^{5,6} Here, we introduce MatPES, a foundational PES dataset comprising sim 400,000 structures carefully sampled from 281 million molecular dynamics snapshots that span 16 billion atomic environments. We demonstrate that UMLIPs trained on the modestly sized MatPES dataset can rival, or even outperform, prior models trained on much larger datasets across a broad range of equilibrium, near-equilibrium, and molecular dynamics property benchmarks. We also introduce the first high-fidelity PES dataset based on the revised regularized strongly constrained and appropriately normed (r^2SCAN) functional^7 with greatly improved descriptions of interatomic bonding. The open source MatPES initiative emphasizes the importance of data quality over quantity in materials science and enables broad community-driven advancements toward more reliable, generalizable, and efficient UMLIPs for large-scale materials discovery and design.
MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation
Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design. Code are available at https://github.com/Eurekashen/MolSpectLLM{https://github.com/Eurekashen/MolSpectLLM}.
A Large Encoder-Decoder Family of Foundation Models For Chemical Language
Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and 8times289M). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS^2-based Proteomics
Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS^2) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS^2 analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS^2 spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
Transformers for molecular property prediction: Domain adaptation efficiently improves performance
Most of the current transformer-based chemical language models are pre-trained on millions to billions of molecules. However, the improvement from such scaling in dataset size is not confidently linked to improved molecular property prediction. The aim of this study is to investigate and overcome some of the limitations of transformer models in predicting molecular properties. Specifically, we examine the impact of pre-training dataset size and diversity on the performance of transformer models and investigate the use of domain adaptation as a technique for improving model performance. First, our findings indicate that increasing pretraining dataset size beyond 400K molecules from the GuacaMol dataset does not result in a significant improvement on four ADME endpoints, namely, solubility, permeability, microsomal stability, and plasma protein binding. Second, our results demonstrate that using domain adaptation by further training the transformer model on a small set of domain-relevant molecules, i.e., a few hundred to a few thousand, using multi-task regression of physicochemical properties was sufficient to significantly improve performance for three out of the four investigated ADME endpoints (P-value < 0.001). Finally, we observe that a model pre-trained on 400K molecules and domain adopted on a few hundred/thousand molecules performs similarly (P-value > 0.05) to more complicated transformer models like MolBERT(pre-trained on 1.3M molecules) and MolFormer (pre-trained on 100M molecules). A comparison to a random forest model trained on basic physicochemical properties showed similar performance to the examined transformer models. We believe that current transformer models can be improved through further systematic analysis of pre-training and downstream data, pre-training objectives, and scaling laws, ultimately leading to better and more helpful models.
System-specific parameter optimization for non-polarizable and polarizable force fields
The accuracy of classical force fields (FFs) has been shown to be limited for the simulation of cation-protein systems despite their importance in understanding the processes of life. Improvements can result from optimizing the parameters of classical FFs or by extending the FF formulation by terms describing charge transfer and polarization effects. In this work, we introduce our implementation of the CTPOL model in OpenMM, which extends the classical additive FF formula by adding charge transfer (CT) and polarization (POL). Furthermore, we present an open-source parameterization tool, called FFAFFURR that enables the (system specific) parameterization of OPLS-AA and CTPOL models. The performance of our workflow was evaluated by its ability to reproduce quantum chemistry energies and by molecular dynamics simulations of a Zinc finger protein.
Molecule3D: A Benchmark for Predicting 3D Geometries from Molecular Graphs
Graph neural networks are emerging as promising methods for modeling molecular graphs, in which nodes and edges correspond to atoms and chemical bonds, respectively. Recent studies show that when 3D molecular geometries, such as bond lengths and angles, are available, molecular property prediction tasks can be made more accurate. However, computing of 3D molecular geometries requires quantum calculations that are computationally prohibitive. For example, accurate calculation of 3D geometries of a small molecule requires hours of computing time using density functional theory (DFT). Here, we propose to predict the ground-state 3D geometries from molecular graphs using machine learning methods. To make this feasible, we develop a benchmark, known as Molecule3D, that includes a dataset with precise ground-state geometries of approximately 4 million molecules derived from DFT. We also provide a set of software tools for data processing, splitting, training, and evaluation, etc. Specifically, we propose to assess the error and validity of predicted geometries using four metrics. We implement two baseline methods that either predict the pairwise distance between atoms or atom coordinates in 3D space. Experimental results show that, compared with generating 3D geometries with RDKit, our method can achieve comparable prediction accuracy but with much smaller computational costs. Our Molecule3D is available as a module of the MoleculeX software library (https://github.com/divelab/MoleculeX).
CACTUS: Chemistry Agent Connecting Tool-Usage to Science
Large language models (LLMs) have shown remarkable potential in various domains, but they often lack the ability to access and reason over domain-specific knowledge and tools. In this paper, we introduced CACTUS (Chemistry Agent Connecting Tool-Usage to Science), an LLM-based agent that integrates cheminformatics tools to enable advanced reasoning and problem-solving in chemistry and molecular discovery. We evaluate the performance of CACTUS using a diverse set of open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama2-7b, and Mistral-7b, on a benchmark of thousands of chemistry questions. Our results demonstrate that CACTUS significantly outperforms baseline LLMs, with the Gemma-7b and Mistral-7b models achieving the highest accuracy regardless of the prompting strategy used. Moreover, we explore the impact of domain-specific prompting and hardware configurations on model performance, highlighting the importance of prompt engineering and the potential for deploying smaller models on consumer-grade hardware without significant loss in accuracy. By combining the cognitive capabilities of open-source LLMs with domain-specific tools, CACTUS can assist researchers in tasks such as molecular property prediction, similarity searching, and drug-likeness assessment. Furthermore, CACTUS represents a significant milestone in the field of cheminformatics, offering an adaptable tool for researchers engaged in chemistry and molecular discovery. By integrating the strengths of open-source LLMs with domain-specific tools, CACTUS has the potential to accelerate scientific advancement and unlock new frontiers in the exploration of novel, effective, and safe therapeutic candidates, catalysts, and materials. Moreover, CACTUS's ability to integrate with automated experimentation platforms and make data-driven decisions in real time opens up new possibilities for autonomous discovery.
MolErr2Fix:Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision
Large Language Models (LLMs) have shown growing potential in molecular sciences, but they often produce chemically inaccurate descriptions and struggle to recognize or justify potential errors. This raises important concerns about their robustness and reliability in scientific applications. To support more rigorous evaluation of LLMs in chemical reasoning, we present the MolErr2Fix benchmark, designed to assess LLMs on error detection and correction in molecular descriptions. Unlike existing benchmarks focused on molecule-to-text generation or property prediction, MolErr2Fix emphasizes fine-grained chemical understanding. It tasks LLMs with identifying, localizing, explaining, and revising potential structural and semantic errors in molecular descriptions. Specifically, MolErr2Fix consists of 1,193 fine-grained annotated error instances. Each instance contains quadruple annotations, i.e,. (error type, span location, the explanation, and the correction). These tasks are intended to reflect the types of reasoning and verification required in real-world chemical communication. Evaluations of current state-of-the-art LLMs reveal notable performance gaps, underscoring the need for more robust chemical reasoning capabilities. MolErr2Fix provides a focused benchmark for evaluating such capabilities and aims to support progress toward more reliable and chemically informed language models. All annotations and an accompanying evaluation API will be publicly released to facilitate future research.
Agentic End-to-End De Novo Protein Design for Tailored Dynamics Using a Language Diffusion Model
Proteins are dynamic molecular machines whose biological functions, spanning enzymatic catalysis, signal transduction, and structural adaptation, are intrinsically linked to their motions. Designing proteins with targeted dynamic properties, however, remains a challenge due to the complex, degenerate relationships between sequence, structure, and molecular motion. Here, we introduce VibeGen, a generative AI framework that enables end-to-end de novo protein design conditioned on normal mode vibrations. VibeGen employs an agentic dual-model architecture, comprising a protein designer that generates sequence candidates based on specified vibrational modes and a protein predictor that evaluates their dynamic accuracy. This approach synergizes diversity, accuracy, and novelty during the design process. Via full-atom molecular simulations as direct validation, we demonstrate that the designed proteins accurately reproduce the prescribed normal mode amplitudes across the backbone while adopting various stable, functionally relevant structures. Notably, generated sequences are de novo, exhibiting no significant similarity to natural proteins, thereby expanding the accessible protein space beyond evolutionary constraints. Our work integrates protein dynamics into generative protein design, and establishes a direct, bidirectional link between sequence and vibrational behavior, unlocking new pathways for engineering biomolecules with tailored dynamical and functional properties. This framework holds broad implications for the rational design of flexible enzymes, dynamic scaffolds, and biomaterials, paving the way toward dynamics-informed AI-driven protein engineering.
Adapting Quantum Machine Learning for Energy Dissociation of Bonds
Accurate prediction of bond dissociation energies (BDEs) underpins mechanistic insight and the rational design of molecules and materials. We present a systematic, reproducible benchmark comparing quantum and classical machine learning models for BDE prediction using a chemically curated feature set encompassing atomic properties (atomic numbers, hybridization), bond characteristics (bond order, type), and local environmental descriptors. Our quantum framework, implemented in Qiskit Aer on six qubits, employs ZZFeatureMap encodings with variational ansatz (RealAmplitudes) across multiple architectures Variational Quantum Regressors (VQR), Quantum Support Vector Regressors (QSVR), Quantum Neural Networks (QNN), Quantum Convolutional Neural Networks (QCNN), and Quantum Random Forests (QRF). These are rigorously benchmarked against strong classical baselines, including Support Vector Regression (SVR), Random Forests (RF), and Multi-Layer Perceptrons (MLP). Comprehensive evaluation spanning absolute and relative error metrics, threshold accuracies, and error distributions shows that top-performing quantum models (QCNN, QRF) match the predictive accuracy and robustness of classical ensembles and deep networks, particularly within the chemically prevalent mid-range BDE regime. These findings establish a transparent baseline for quantum-enhanced molecular property prediction and outline a practical foundation for advancing quantum computational chemistry toward near chemical accuracy.
