Journal of Cheminformatics

期刊名称：Journal of Cheminformatics

期刊ISSN：1758-2946

期刊官方网站：http://jcheminf.springeropen.com/

出版商：Springer Nature

出版周期：

影响因子：8.489

始发年份：2009

年文章数：66

是否OA：是

Merging bioactivity predictions from cell morphology and chemical fingerprint models using similarity to training data

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-06-02 , DOI: 10.1186/s13321-023-00723-x

SrijitSeal,HongbinYang,Maria-AnnaTrapotsi,SatvikSingh,JordiCarreras-Puigvert,OlaSpjuth,AndreasBender

The applicability domain of machine learning models trained on structural fingerprints for the prediction of biological endpoints is often limited by the lack of diversity of chemical space of the training data. In this work, we developed similarity-based merger models which combined the outputs of individual models trained on cell morphology (based on Cell Painting) and chemical structure (based on chemical fingerprints) and the structural and morphological similarities of the compounds in the test dataset to compounds in the training dataset. We applied these similarity-based merger models using logistic regression models on the predictions and similarities as features and predicted assay hit calls of 177 assays from ChEMBL, PubChem and the Broad Institute (where the required Cell Painting annotations were available). We found that the similarity-based merger models outperformed other models with an additional 20% assays (79 out of 177 assays) with an AUC > 0.70 compared with 65 out of 177 assays using structural models and 50 out of 177 assays using Cell Painting models. Our results demonstrated that similarity-based merger models combining structure and cell morphology models can more accurately predict a wide range of biological assay outcomes and further expanded the applicability domain by better extrapolating to new structural and morphology spaces.

UmetaFlow: an untargeted metabolomics workflow for high-throughput data processing and analysis

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-05-12 , DOI: 10.1186/s13321-023-00724-w

EftychiaEKontou,AxelWalter,OliverAlka,JulianusPfeuffer,TimoSachsenberg,OmkarSMohite,MatinNuhamunada,OliverKohlbacher,TilmannWeber

Metabolomics experiments generate highly complex datasets, which are time and work-intensive, sometimes even error-prone if inspected manually. Therefore, new methods for automated, fast, reproducible, and accurate data processing and dereplication are required. Here, we present UmetaFlow, a computational workflow for untargeted metabolomics that combines algorithms for data pre-processing, spectral matching, molecular formula and structural predictions, and an integration to the GNPS workflows Feature-Based Molecular Networking and Ion Identity Molecular Networking for downstream analysis. UmetaFlow is implemented as a Snakemake workflow, making it easy to use, scalable, and reproducible. For more interactive computing, visualization, as well as development, the workflow is also implemented in Jupyter notebooks using the Python programming language and a set of Python bindings to the OpenMS algorithms (pyOpenMS). Finally, UmetaFlow is also offered as a web-based Graphical User Interface for parameter optimization and processing of smaller-sized datasets. UmetaFlow was validated with in-house LC–MS/MS datasets of actinomycetes producing known secondary metabolites, as well as commercial standards, and it detected all expected features and accurately annotated 76% of the molecular formulas and 65% of the structures. As a more generic validation, the publicly available MTBLS733 and MTBLS736 datasets were used for benchmarking, and UmetaFlow detected more than 90% of all ground truth features and performed exceptionally well in quantification and discriminating marker selection. We anticipate that UmetaFlow will provide a useful platform for the interpretation of large metabolomics datasets.

qHTSWaterfall: 3-dimensional visualization software for quantitative high-throughput screening (qHTS) data

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-03-31 , DOI: 10.1186/s13321-023-00717-9

BryanQueme,JohnCBraisted,PatriciaDranchak,JamesInglese

High throughput screening (HTS) is widely used in drug discovery and chemical biology to identify and characterize agents having pharmacologic properties often by evaluation of large chemical libraries. Standard HTS data can be simply plotted as an x–y graph usually represented as % activity of a compound tested at a single concentration vs compound ID, whereas quantitative HTS (qHTS) data incorporates a third axis represented by concentration. By virtue of the additional data points arising from the compound titration and the incorporation of logistic fit parameters that define the concentration–response curve, such as EC50 and Hill slope, qHTS data has been challenging to display on a single graph. Here we provide a flexible solution to the rapid plotting of complete qHTS data sets to produce a 3-axis plot we call qHTS Waterfall Plots. The software described here can be generally applied to any 3-axis dataset and is available as both an R package and an R shiny application.

Deep generative model for drug design from protein target sequence

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-03-28 , DOI: 10.1186/s13321-023-00702-2

YangyangChen,ZixuWang,LeiWang,JianminWang,PengyongLi,DongshengCao,XiangxiangZeng,XiucaiYe,TetsuyaSakurai

Drug discovery for a protein target is a laborious and costly process. Deep learning (DL) methods have been applied to drug discovery and successfully generated novel molecular structures, and they can substantially reduce development time and costs. However, most of them rely on prior knowledge, either by drawing on the structure and properties of known molecules to generate similar candidate molecules or extracting information on the binding sites of protein pockets to obtain molecules that can bind to them. In this paper, DeepTarget, an end-to-end DL model, was proposed to generate novel molecules solely relying on the amino acid sequence of the target protein to reduce the heavy reliance on prior knowledge. DeepTarget includes three modules: Amino Acid Sequence Embedding (AASE), Structural Feature Inference (SFI), and Molecule Generation (MG). AASE generates embeddings from the amino acid sequence of the target protein. SFI inferences the potential structural features of the synthesized molecule, and MG seeks to construct the eventual molecule. The validity of the generated molecules was demonstrated by a benchmark platform of molecular generation models. The interaction between the generated molecules and the target proteins was also verified on the basis of two metrics, drug–target affinity and molecular docking. The results of the experiments indicated the efficacy of the model for direct molecule generation solely conditioned on amino acid sequence.

Chemical rules for optimization of chemical mutagenicity via matched molecular pairs analysis and machine learning methods

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-03-20 , DOI: 10.1186/s13321-023-00707-x

ChaofengLou,HongbinYang,HuaDeng,MengtingHuang,WeihuaLi,GuixiaLiu,PhilipWLee,YunTang

Chemical mutagenicity is a serious issue that needs to be addressed in early drug discovery. Over a long period of time, medicinal chemists have manually summarized a series of empirical rules for the optimization of chemical mutagenicity. However, given the rising amount of data, it is getting more difficult for medicinal chemists to identify more comprehensive chemical rules behind the biochemical data. Herein, we integrated a large Ames mutagenicity data set with 8576 compounds to derive mutagenicity transformation rules for reversing Ames mutagenicity via matched molecular pairs analysis. A well-trained consensus model with a reasonable applicability domain was constructed, which showed favorable performance in the external validation set with an accuracy of 0.815. The model was used to assess the generalizability and validity of these mutagenicity transformation rules. The results demonstrated that these rules were of great value and could provide inspiration for the structural modifications of compounds with potential mutagenic effects. We also found that the local chemical environment of the attachment points of rules was critical for successful transformation. To facilitate the use of these mutagenicity transformation rules, we integrated them into ADMETopt2 ( http://lmmd.ecust.edu.cn/admetsar2/admetopt2/ ), a free web server for optimization of chemical ADMET properties. The above-mentioned approach would be extended to the optimization of other toxicity endpoints.

ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-26 , DOI: 10.1186/s13321-023-00698-9

ChengyouLiu,YanSun,RebeccaDavis,SilviaTCardona,PingzhaoHu

Graph convolutional neural networks (GCNs) have been repeatedly shown to have robust capacities for modeling graph data such as small molecules. Message-passing neural networks (MPNNs), a group of GCN variants that can learn and aggregate local information of molecules through iterative message-passing iterations, have exhibited advancements in molecular modeling and property prediction. Moreover, given the merits of Transformers in multiple artificial intelligence domains, it is desirable to combine the self-attention mechanism with MPNNs for better molecular representation. We propose an atom-bond transformer-based message-passing neural network (ABT-MPNN), to improve the molecular representation embedding process for molecular property predictions. By designing corresponding attention mechanisms in the message-passing and readout phases of the MPNN, our method provides a novel architecture that integrates molecular representations at the bond, atom and molecule levels in an end-to-end way. The experimental results across nine datasets show that the proposed ABT-MPNN outperforms or is comparable to the state-of-the-art baseline models in quantitative structure–property relationship tasks. We provide case examples of Mycobacterium tuberculosis growth inhibitors and demonstrate that our model's visualization modality of attention at the atomic level could be an insightful way to investigate molecular atoms or functional groups associated with desired biological properties. The new model provides an innovative way to investigate the effect of self-attention on chemical substructures and functional groups in molecular representation learning, which increases the interpretability of the traditional MPNN and can serve as a valuable way to investigate the mechanism of action of drugs.

Predicting RP-LC retention indices of structurally unknown chemicals from mass spectrometry data

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-24 , DOI: 10.1186/s13321-023-00699-8

JimBoelrijk,DenicevanHerwerden,BerndEnsing,PatrickForré,SaerSamanipour

Non-target analysis combined with liquid chromatography high resolution mass spectrometry is considered one of the most comprehensive strategies for the detection and identification of known and unknown chemicals in complex samples. However, many compounds remain unidentified due to data complexity and limited number structures in chemical databases. In this work, we have developed and validated a novel machine learning algorithm to predict the retention index (r $$_i$$ ) values for structurally (un)known chemicals based on their measured fragmentation pattern. The developed model, for the first time, enabled the predication of r $$_i$$ values without the need for the exact structure of the chemicals, with an $$R^2$$ of 0.91 and 0.77 and root mean squared error (RMSE) of 47 and 67 r $$_i$$ units for the NORMAN ( $$n=3131$$ ) and amide ( $$n=604$$ ) test sets, respectively. This fragment based model showed comparable accuracy in r $$_i$$ prediction compared to conventional descriptor-based models that rely on known chemical structure, which obtained an $$R^2$$ of 0.85 with an RMSE of 67.

Reconstruction of lossless molecular representations from fingerprints

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-23 , DOI: 10.1186/s13321-023-00693-0

UmitVUcak,IslambekAshyrmamatov,JuyongLee

The simplified molecular-input line-entry system (SMILES) is the most prevalent molecular representation used in AI-based chemical applications. However, there are innate limitations associated with the internal structure of SMILES representations. In this context, this study exploits the resolution and robustness of unique molecular representations, i.e., SMILES and SELFIES (SELF-referencIng Embedded strings), reconstructed from a set of structural fingerprints, which are proposed and used herein as vital representational tools for chemical and natural language processing (NLP) applications. This is achieved by restoring the connectivity information lost during fingerprint transformation with high accuracy. Notably, the results reveal that seemingly irreversible molecule-to-fingerprint conversion is feasible. More specifically, four structural fingerprints, extended connectivity, topological torsion, atom pairs, and atomic environments can be used as inputs and outputs of chemical NLP applications. Therefore, this comprehensive study addresses the major limitation of structural fingerprints that precludes their use in NLP models. Our findings will facilitate the development of text- or fingerprint-based chemoinformatic models for generative and translational tasks.

Exploring QSAR models for activity-cliff prediction

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-04-17 , DOI: 10.1186/s13321-023-00708-w

MarkusDablander,ThierryHanser,RenaudLambiotte,GarrettMMorris

Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that QSAR models struggle to predict ACs and that ACs thus form a major source of prediction error. However, the AC-prediction power of modern QSAR methods and its quantitative relationship to general QSAR-prediction performance is still underexplored. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. Our results provide strong support for the hypothesis that indeed QSAR models frequently fail to predict ACs. We observe low AC-sensitivity amongst the evaluated models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance amongs the tested input representations. A potential future pathway to improve QSAR-modelling performance might be the development of techniques to increase AC-sensitivity.

LinChemIn: SynGraph—a data model and a toolkit to analyze and compare synthetic routes

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-04-01 , DOI: 10.1186/s13321-023-00714-y

MartaPasquini,MarcoStenta

The increasing amount of chemical reaction data makes traditional ways to navigate its corpus less effective, while the demand for novel approaches and instruments is rising. Recent data science and machine learning techniques support the development of new ways to extract value from the available reaction data. On the one side, Computer-Aided Synthesis Planning tools can predict synthetic routes in a model-driven approach; on the other side, experimental routes can be extracted from the Network of Organic Chemistry, in which reaction data are linked in a network. In this context, the need to combine, compare and analyze synthetic routes generated by different sources arises naturally. Here we present LinChemIn, a python toolkit that allows chemoinformatics operations on synthetic routes and reaction networks. Wrapping some third-party packages for handling graph arithmetic and chemoinformatics and implementing new data models and functionalities, LinChemIn allows the interconversion between data formats and data models and enables route-level analysis and operations, including route comparison and descriptors calculation. Object-Oriented Design principles inspire the software architecture, and the modules are structured to maximize code reusability and support code testing and refactoring. The code structure should facilitate external contributions, thus encouraging open and collaborative software development. The current version of LinChemIn allows users to combine synthetic routes generated from various tools and analyze them, and constitutes an open and extensible framework capable of incorporating contributions from the community and fostering scientific discussion. Our roadmap envisages the development of sophisticated metrics for routes evaluation, a multi-parameter scoring system, and the implementation of an entire “ecosystem” of functionalities operating on synthetic routes. LinChemIn is freely available at http://github.com/syngenta/linchemin.

Principles and requirements for nanomaterial representations to facilitate machine processing and cooperation with nanoinformatics tools

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-04-12 , DOI: 10.1186/s13321-022-00669-6

KostasBlekos,KostasChairetakis,IseultLynch,EffieMarcoulaki

Efficient and machine-readable representations are needed to accurately identify, validate and communicate information of chemical structures. Many such representations have been developed (as, for example, the Simplified Molecular-Input Line-Entry System and the IUPAC International Chemical Identifier), each offering advantages specific to various use-cases. Representation of the multi-component structures of nanomaterials (NMs), though, remains out of scope for all the currently available standards, as the nature of NMs sets new challenges on formalizing the encoding of their structure, interactions and environmental parameters. In this work we identify a set of principles that a NM representation should adhere to in order to provide “machine-friendly” encodings of NMs, i.e. encodings that facilitate machine processing and cooperation with nanoinformatics tools. We illustrate our principles by showing how the recently introduced InChI-based NM representation, might be augmented, in principle, to also encode morphology and mixture properties, distributions of properties, and also to capture auxiliary information and allow data reuse.

DFFNDDS: prediction of synergistic drug combinations with dual feature fusion networks

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-03-16 , DOI: 10.1186/s13321-023-00690-3

MengdieXu,XinweiZhao,JingyuWang,WeiFeng,NaifengWen,ChunyuWang,JunjieWang,YunLiu,LinglingZhao

Drug combination therapies are promising clinical treatments for curing patients. However, efficiently identifying valid drug combinations remains challenging because the number of available drugs has increased rapidly. In this study, we proposed a deep learning model called the Dual Feature Fusion Network for Drug–Drug Synergy prediction (DFFNDDS) that utilizes a fine-tuned pretrained language model and dual feature fusion mechanism to predict synergistic drug combinations. The dual feature fusion mechanism fuses the drug features and cell line features at the bit-wise level and the vector-wise level. We demonstrated that DFFNDDS outperforms competitive methods and can serve as a reliable tool for identifying synergistic drug combinations.

How to approach machine learning-based prediction of drug/compound–target interactions

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-06 , DOI: 10.1186/s13321-023-00689-w

HevalAtasGuvenilir,TuncaDoğan

The identification of drug/compound–target interactions (DTIs) constitutes the basis of drug discovery, for which computational predictive approaches have been developed. As a relatively new data-driven paradigm, proteochemometric (PCM) modeling utilizes both protein and compound properties as a pair at the input level and processes them via statistical/machine learning. The representation of input samples (i.e., proteins and their ligands) in the form of quantitative feature vectors is crucial for the extraction of interaction-related properties during the artificial learning and subsequent prediction of DTIs. Lately, the representation learning approach, in which input samples are automatically featurized via training and applying a machine/deep learning model, has been utilized in biomedical sciences. In this study, we performed a comprehensive investigation of different computational approaches/techniques for protein featurization (including both conventional approaches and the novel learned embeddings), data preparation and exploration, machine learning-based modeling, and performance evaluation with the aim of achieving better data representations and more successful learning in DTI prediction. For this, we first constructed realistic and challenging benchmark datasets on small, medium, and large scales to be used as reliable gold standards for specific DTI modeling tasks. We developed and applied a network analysis-based splitting strategy to divide datasets into structurally different training and test folds. Using these datasets together with various featurization methods, we trained and tested DTI prediction models and evaluated their performance from different angles. Our main findings can be summarized under 3 items: (i) random splitting of datasets into train and test folds leads to near-complete data memorization and produce highly over-optimistic results, as a result, should be avoided, (ii) learned protein sequence embeddings work well in DTI prediction and offer high potential, despite interaction-related properties (e.g., structures) of proteins are unused during their self-supervised model training, and (iii) during the learning process, PCM models tend to rely heavily on compound features while partially ignoring protein features, primarily due to the inherent bias in DTI data, indicating the requirement for new and unbiased datasets. We hope this study will aid researchers in designing robust and high-performing data-driven DTI prediction systems that have real-world translational value in drug discovery.

Correction: Reconstruction of lossless molecular representations from fingerprints

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-07-26 , DOI: 10.1186/s13321-023-00739-3

UmitV.Ucak,IslambekAshyrmamatov,JuyongLee

Correction : Journal of Cheminformatics (2023) 15:26 http://doi.org/10.1186/s13321-023-00693-0 Following publication of the original article [1], the authors requested to correct the funding number NRF-2019M3E5D4066898 to NRF-2022M3E5F3081268. Funding This work was supported by National Research Foundation of Korea (NRF) Grants funded by the Korean government (MSIT) (Nos. NRF-2022M3E5F3081268, NRF-2022R1C1C1005080 and NRF-2020M3A9G7103933 to I.A. and J.L.). This work was also supported by the Korea Environment Industry & Technology Institute (KEITI) through the Technology Development Project for Safety Management of Household Chemical Products, funded by the Korea Ministry of Environment (MOE) (KEITI:2020002960002 and NTIS:1485017120 to U.V.U. and J.L.).The original article has been corrected.Ucak UV, Ashyrmamatov I, Lee J (2023) Reconstruction of lossless molecular representations from fingerprints. J Cheminform 15:26. http://doi.org/10.1186/s13321-023-00693-0Article PubMed PubMed Central Google Scholar Download referencesAuthors and AffiliationsResearch Institute of Pharmaceutical Science, College of Pharmacy, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of KoreaUmit V. Ucak & Juyong LeeDepartment of Chemistry, Kangwon National University, Chuncheon, 24341, Republic of KoreaIslambek AshyrmamatovMolecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of KoreaJuyong LeeAuthorsUmit V. UcakView author publicationsYou can also search for this author in PubMed Google ScholarIslambek AshyrmamatovView author publicationsYou can also search for this author in PubMed Google ScholarJuyong LeeView author publicationsYou can also search for this author in PubMed Google ScholarCorresponding authorCorrespondence to Juyong Lee.Publisher's NoteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Reprints and PermissionsCite this articleUcak, U.V., Ashyrmamatov, I. & Lee, J. Correction: Reconstruction of lossless molecular representations from fingerprints. J Cheminform 15, 68 (2023). http://doi.org/10.1186/s13321-023-00739-3Download citationPublished: 26 July 2023DOI: http://doi.org/10.1186/s13321-023-00739-3Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative

Generative model based on junction tree variational autoencoder for HOMO value prediction and molecular optimization

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-02 , DOI: 10.1186/s13321-023-00681-4

VladimirKondratyev,MarianDryzhakov,TimurGimadiev,DmitriySlutskiy

In this work, we provide further development of the junction tree variational autoencoder (JT VAE) architecture in terms of implementation and application of the internal feature space of the model. Pretraining of JT VAE on a large dataset and further optimization with a regression model led to a latent space that can solve several tasks simultaneously: prediction, generation, and optimization. We use the ZINC database as a source of molecules for the JT VAE pretraining and the QM9 dataset with its HOMO values to show the application case. We evaluate our model on multiple tasks such as property (value) prediction, generation of new molecules with predefined properties, and structure modification toward the property. Across these tasks, our model shows improvements in generation and optimization tasks while preserving the precision of state-of-the-art models.

Force field-inspired molecular representation learning for property prediction

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-06 , DOI: 10.1186/s13321-023-00691-2

Gao-PengRen,Yi-JianYin,Ke-JunWu,YuchenHe

Molecular representation learning is a crucial task to accelerate drug discovery and materials design. Graph neural networks (GNNs) have emerged as a promising approach to tackle this task. However, most of them do not fully consider the intramolecular interactions, i.e. bond stretching, angle bending, torsion, and nonbonded interactions, which are critical for determining molecular property. Recently, a growing number of 3D-aware GNNs have been proposed to cope with the issue, while these models usually need large datasets and accurate spatial information. In this work, we aim to design a GNN which is less dependent on the quantity and quality of datasets. To this end, we propose a force field-inspired neural network (FFiNet), which can include all the interactions by incorporating the functional form of the potential energy of molecules. Experiments show that FFiNet achieves state-of-the-art performance on various molecular property datasets including both small molecules and large protein–ligand complexes, even on those datasets which are relatively small and without accurate spatial information. Moreover, the visualization for FFiNet indicates that it automatically learns the relationship between property and structure, which can promote an in-depth understanding of molecular structure.

In-silico target prediction by ensemble chemogenomic model based on multi-scale information of chemical structures and protein sequences

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-04-23 , DOI: 10.1186/s13321-023-00720-0

Su-QingYang,Liu-XiaZhang,You-JinGe,Jin-WeiZhang,Jian-XinHu,Cheng-YingShen,Ai-PingLu,Ting-JunHou,Dong-ShengCao

Identification and validation of bioactive small-molecule targets is a significant challenge in drug discovery. In recent years, various in-silico approaches have been proposed to expedite time- and resource-consuming experiments for target detection. Herein, we developed several chemogenomic models for target prediction based on multi-scale information of chemical structures and protein sequences. By combining the information of a compound with multiple protein targets together and putting these compound-target pairs into a well-established model, the scores to indicate whether there are interactions between compounds and targets can be derived, and thus a target prediction task can be completed by sorting the outputted scores. To improve the prediction performance, we constructed several chemogenomic models using multi-scale information of chemical structures and protein sequences, and the ensemble model with the best performance was used as our final model. The model was validated by various strategies and external datasets and the promising target prediction capability of the model, i.e., the fraction of known targets identified in the top-k (1 to 10) list of the potential target candidates suggested by the model, was confirmed. Compared with multiple state-of-art target prediction methods, our model showed equivalent or better predictive ability in terms of the top-k predictions. It is expected that our method can be utilized as a powerful computational tool to narrow down the potential targets for experimental testing.

GlyLES: Grammar-based Parsing of Glycans from IUPAC-condensed to SMILES

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-03-23 , DOI: 10.1186/s13321-023-00704-0

RomanJoeres,DanielBojar,OlgaVKalinina

Glycans are important polysaccharides on cellular surfaces that are bound to glycoproteins and glycolipids. These are one of the most common post-translational modifications of proteins in eukaryotic cells. They play important roles in protein folding, cell-cell interactions, and other extracellular processes. Changes in glycan structures may influence the course of different diseases, such as infections or cancer. Glycans are commonly represented using the IUPAC-condensed notation. IUPAC-condensed is a textual representation of glycans operating on the same topological level as the Symbol Nomenclature for Glycans (SNFG) that assigns colored, geometrical shapes to the main monomers. These symbols are then connected in tree-like structures, visualizing the glycan structure on a topological level. Yet for a representation on the atomic level, notations such as SMILES should be used. To our knowledge, there is no easy-to-use, general, open-source, and offline tool to convert the IUPAC-condensed notation to SMILES. Here, we present the open-access Python package GlyLES for the generalizable generation of SMILES representations out of IUPAC-condensed representations. GlyLES uses a grammar to read in the monomer tree from the IUPAC-condensed notation. From this tree, the tool can compute the atomic structures of each monomer based on their IUPAC-condensed descriptions. In the last step, it merges all monomers into the atomic structure of a glycan in the SMILES notation. GlyLES is the first package that allows conversion from the IUPAC-condensed notation of glycans to SMILES strings. This may have multiple applications, including straightforward visualization, substructure search, molecular modeling and docking, and a new featurization strategy for machine-learning algorithms. GlyLES is available at http://github.com/kalininalab/GlyLES .

Double-head transformer neural network for molecular property prediction

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-23 , DOI: 10.1186/s13321-023-00700-4

YuanbingSong,JinghuaChen,WenjuWang,GangChen,ZhichongMa

Existing molecular property prediction methods based on deep learning ignore the generalization ability of the nonlinear representation of molecular features and the reasonable assignment of weights of molecular features, making it difficult to further improve the accuracy of molecular property prediction. To solve the above problems, an end-to-end double-head transformer neural network (DHTNN) is proposed in this paper for high-precision molecular property prediction. For the data distribution characteristics of the molecular dataset, DHTNN specially designs a new activation function, beaf, which can greatly improve the generalization ability of the nonlinear representation of molecular features. A residual network is introduced in the molecular encoding part to solve the gradient explosion problem and ensure that the model can converge quickly. The transformer based on double-head attention is used to extract molecular intrinsic detail features, and the weights are reasonably assigned for predicting molecular properties with high accuracy. Our model, which was tested on the MoleculeNet [1] benchmark dataset, showed significant performance improvements over other state-of-the-art methods.

DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning

Journal of Cheminformatics ( IF 8.489 ) Pub Date : 2023-02-20 , DOI: 10.1186/s13321-023-00694-z

XuhanLiu,KaiYe,HermanWTvanVlijmen,AdriaanPIJzerman,GerardJPvanWesten

Rational drug design often starts from specific scaffolds to which side chains/substituents are added or modified due to the large drug-like chemical space available to search for novel drug-like molecules. With the rapid growth of deep learning in drug discovery, a variety of effective approaches have been developed for de novo drug design. In previous work we proposed a method named DrugEx, which can be applied in polypharmacology based on multi-objective deep reinforcement learning. However, the previous version is trained under fixed objectives and does not allow users to input any prior information (i.e. a desired scaffold). In order to improve the general applicability, we updated DrugEx to design drug molecules based on scaffolds which consist of multiple fragments provided by users. Here, a Transformer model was employed to generate molecular structures. The Transformer is a multi-head self-attention deep learning model containing an encoder to receive scaffolds as input and a decoder to generate molecules as output. In order to deal with the graph representation of molecules a novel positional encoding for each atom and bond based on an adjacency matrix was proposed, extending the architecture of the Transformer. The graph Transformer model contains growing and connecting procedures for molecule generation starting from a given scaffold based on fragments. Moreover, the generator was trained under a reinforcement learning framework to increase the number of desired ligands. As a proof of concept, the method was applied to design ligands for the adenosine A2A receptor (A2AAR) and compared with SMILES-based methods. The results show that 100% of the generated molecules are valid and most of them had a high predicted affinity value towards A2AAR with given scaffolds.

中科院SCI期刊分区

大类学科	小类学科	TOP	综述
化学2区	CHEMISTRY, MULTIDISCIPLINARY 化学综合3区	否	否

补充信息

自引率	H-index	SCI收录状况	PubMed Central (PML)
8.30	30	Science Citation Index Expanded	否

投稿指南

期刊投稿网址: http://www.editorialmanager.com/CHIN/default.aspx
投稿指南: http://jcheminf.biomedcentral.com/submission-guidelines
投稿模板: http://jcheminf.biomedcentral.com/submission-guidelines/preparing-your-manuscript
参考文献格式: http://jcheminf.biomedcentral.com/submission-guidelines/preparing-your-manuscript
收稿范围: All aspects of cheminformatics and molecular modellingchemical information systems, software and databases, and molecular modellingchemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databasescomputer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques
收录载体: Research article Book report Commentary Database Meeting report Methodology Preliminary communication Review Software Educational Letter to the Editor