We recently released a review of machine learning methods in protein engineering, but the field changes so fast and there are so many new papers that any static document will inevitably be missing important work. This format also allows us to broaden the scope beyond engineering-specific applications. We hope that this will be a useful resource for people interested in the field.
To the best of our knowledge, this is the first public, collaborative list of machine learning papers on protein applications. We try to classify papers based on a combination of their applications and model type. If you have suggestions for other papers or categories, please make a pull request or issue!
Format
Within each category, papers are listed in reverse chronological order (newest first). Where possible, a link should be provided.
The road to fully programmable protein catalysis.
Sarah L. Lovelock, Rebecca Crawshaw, Sophie Basler, Colin Levy, David Baker, Donald Hilvert, Anthony P. Green. Nature, June 2022.
[10.1038/s41586-022-04456-z]
Applications of artificial intelligence to enzyme and pathway design for metabolic engineering.
Woo Dae Jang, Gi Bae Kim, Yeji Kim, Sang Yup Lee. Current Opinion in Biotechnology, February 2022.
[10.1016/j.copbio.2021.07.024]
Adaptive machine learning for protein engineering.
Brian L. Hie, Kevin K. Yang. Current Opinion in Structural Biology, February 2022.
[10.1016/j.sbi.2021.11.002]
Protein sequence design with deep generative models.
Zachary Wu, Kadina E. Johnston, Frances H. Arnold, Kevin K. Yang. Current Opinion in Chemical Biology, December 2021.
[10.1016/j.cbpa.2021.04.004]
AI challenges for predicting the impact of mutations on protein stability.
Fabrizio Pucci, Martin Schwersensky, Marianne Rooman. Preprint, November 2021.
[arxiv]
Advances in machine learning for directed evolution.
Bruce J Wittmann, Kadina E Johnston, Zachary Wu, Frances H Arnold. Current Opinion in Structural Biology, August 2021. 10.1016/j.sbi.2021.01.008]
A Brief Review of Machine Learning Techniques for Protein Phosphorylation Sites Prediction.
Farzaneh Esmaili, Mahdi Pourmirzaei, Shahin Ramazi, Elham Yavari.
Preprint, August 2021.
[arxiv]
Learning the protein language: Evolution, structure, and function.
Tristan Bepler, Bonnie Berger. Cell Systems, June 2021.
[10.1016/j.cels.2021.05.017]
Data-driven computational protein design.
Vincent Frappier, Amy E. Keating. Current Opinion in Structural Biology, May 2021. /10.1016/j.sbi.2021.03.009]
Machine learning in protein structure prediction.
Mohammed AlQuraishi. Current Opinion in Chemical Biology, May 2021.
[10.1016/j.cbpa.2021.04.005]
Protein sequence-to-structure learning: Is this the end(-to-end revolution)?.
Elodie Laine, Stephan Eismann, Arne Elofsson, Sergei Grudinin. Preprint, May 2021.
[arxiv]
Revolutionizing enzyme engineering through artificial intelligence and machine learning.
Nitu Singh, Sunny Malik, Anvita Gupta, Kinshuk Raj Srivastava. Emerging topics in life sciences, April 2021.
[10.1042/ETLS20200257]
The language of proteins: NLP, machine learning & protein sequences.
Dan Ofer, Nadav Brandes, Michal Linial. Computational and Structural Biotechnology Journal, January 2021.
[10.1016/j.csbj.2021.03.022]
Machine learning and AI-based approaches for bioactive ligand discovery and GPCR-ligand recognition.
Sebastian Raschka, Benjamin Kaufman. Preprint, January 2020.
[arXiv]
Machine Learning in Enzyme Engineering.
Stanislav Mazurenko, Zbynek Prokop, Jiri Damborsky. ACS Catalysis, December 2019.
[10.1021/acscatal.9b04321]
Machine learning-guided directed evolution for protein engineering.
Kevin K. Yang, Zachary Wu, Frances H. Arnold. Nature Methods, July 2019.
[10.1038/s41592-019-0496-6]
Preprint available on arxiv.
Evaluating Protein Transfer Learning with TAPE.
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song. Preprint, June 2019.
[arxiv]
Can Machine Learning Revolutionize Directed Evolution of Selective Enzymes?
Guangyue Li, Yijie Dong, Manfred T. Reetz. Advanced Synthesis & Catalysis, March 2019.
[10.1002/adsc.201900149]
Tools
PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding.
Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang, Chang Ma, Runcheng Liu, Jian Tang. Preprint, June 2022.
[arxiv]
Randomized gates eliminate bias in sort-seq assays.
Brian L. Trippe, Buwei Huang, Erika A. DeBenedictis, Brian Coventry, Nicholas Bhattacharya, Kevin K. Yang, David Baker, Lorin Crawford. Preprint, February 2022.
[]10.1101/2022.02.17.480881]
FLIP: Benchmark tasks in fitness landscape inference for proteins.
Christian Dallago, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, Kevin K. Yang. NeurIPS 2021 Datasets and Benchmarks Track, December 2021.
[]10.1101/2021.11.09.467890]
evSeq: Cost-Effective Amplicon Sequencing of Every Variant in a Protein Library.
Bruce J. Wittmann, Kadina E. Johnston, Patrick J. Almhjell, Frances H. Arnold. Preprint, November 2021.
[10.1101/2021.11.18.469179]
The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires.
Milena Pavlović, Lonneke Scheffer, Keshav Motwani, Chakravarthi Kanduri, Radmila Kompova, Nikolay Vazov, Knut Waagan, Fabian L. M. Bernal, Alexandre Almeida Costa, Brian Corrie, Rahmad Akbar, Ghadi S. Al Hajj, Gabriel Balaban, Todd M. Brusko, Maria Chernigovskaya, Scott Christley, Lindsay G. Cowell, Robert Frank, Ivar Grytten, Sveinung Gundersen, Ingrid Hobæk Haff, Eivind Hovig, Ping-Han Hsieh, Günter Klambauer, Marieke L. Kuijjer, Christin Lund-Andersen, Antonio Martini, Thomas Minotto, Johan Pensar, Knut Rand, Enrico Riccardi, Philippe A. Robert, Artur Rocha, Andrei Slabodkin, Igor Snapkov, Ludvig M. Sollid, Dmytro Titov, Cédric R. Weber, Michael Widrich, Gur Yaari, Victor Greiff & Geir Kjetil Sandve. Nature Machine Intelligence, November 2021.
[10.1038/s42256-021-00413-z]
Learned embeddings from deep learning to visualize and predict protein sets.
Christian Dallago, Konstantin Schütze, Michael Heinzinger, Tobias Olenyi, Maria Littmann, Amy X Lu, Kevin K Yang, Seonwoo Min, Sungroh Yoon, James T Morton, Burkhard Rost. Current Protocols, May 2021.
[10.1002/cpz1.113]
Population-Based Black-Box Optimization for Biological Sequence Design.
Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, David Dohan, Kevin Murphy, Lucy Colwell, D Sculley.
ICML, July 2020.
[ICML]
Selene: a PyTorch-based deep learning library for sequence data.
Kathleen M. Chen, Evan M. Cofer, Jian Zhou, Olga G. Troyanskaya. Nature Methods, March 2019.
[10.1038/s41592-019-0360-8]
Machine-learning guided directed evolution
Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space.
Emily K. Makowski, Patrick C. Kinnunen, Jie Huang, Lina Wu, Matthew D. Smith, Tiexin Wang, Alec A. Desai, Craig N. Streu, Yulei Zhang, Jennifer M. Zupancic, John S. Schardt, Jennifer J. Linderman, Peter M. Tessier. Nature communications, July 2022.
[10.1038/s41467-022-31457-3]
Heterogeneity of the GFP fitness landscape and data-driven protein design.
Louisa Gonzalez Somermeyer, Aubin Fleiss, Alexander S Mishin, Nina G Bozhanova, Anna A Igolkina, Jens Meiler, Maria-Elisenda Alaball Pujol, Ekaterina V Putintseva, Karen S Sarkisyan. eLife, May 2022.
[10.7554/eLife.75842]
De novo protein design by deep network hallucination.
Ivan Anishchenko, Samuel J. Pellock, Tamuka M. Chidyausiku, Theresa A. Ramelot, Sergey Ovchinnikov, Jingzhou Hao, Khushboo Bafna, Christoffer Norn, Alex Kang, Asim K. Bera, Frank DiMaio, Lauren Carter, Cameron M. Chow, Gaetano T. Montelione & David Baker. Nature, December 2021.
[10.1038/s41586-021-04184-w]
Informed training set design enables efficient machine learning-assisted directed protein evolution.
Bruce J. Wittmann, Yisong Yue, Frances H. Arnold. Cell Systems, November 2021.
[10.1016/j.cels.2021.07.008]
Machine learning-based library design improves packaging and diversity of adeno-associated virus (AAV) libraries.
Danqing Zhu, David H. Brookes, Akosua Busia, Ana Carneiro, Clara Fannjiang, Galina Popova, David Shin, Edward F. Chang, Tomasz J. Nowakowski, Jennifer Listgarten, David. V. Schaffer.
[10.1101/2021.11.02.467003]
Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models.
Eli N. Weinstein, Alan N. Amin, Will Grathwohl, Daniel Kassler, Jean Disset, Debora S. Marks. Preprint, October 2021.
[10.1101/2021.10.28.466307]
Unifying Likelihood-free Inference with Black-box Sequence Design and Beyond.
Dinghuai Zhang, Jie Fu, Yoshua Bengio, Aaron Courville. Preprint, October 2021.
[arxiv]
Conservative Objective Models for Effective Offline Model-Based Optimization.
Brandon Trabucco, Aviral Kumar, Xinyang Geng, Sergey Levine. Preprint, July 2021.
[arxiv]
Deep Extrapolation for Attribute-Enhanced Generation.
Alvin Chan, Ali Madani, Ben Krause, Nikhil Naik. Preprint, July 2021.
[arxiv]
Effective Surrogate Models for Protein Design with Bayesian Optimization.
Nate Gruver, Samuel Stanton, Polina Kirichenko, Marc Finzi, Phillip Maffettone, Vivek Myers,
Emily Delaney, Peyton Greenside, Andrew Gordon Wilson. 2021 ICML Workshop on Computational Biology, July 2021.
[pdf]
Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution.
Trevor S. Frisby, Christopher James Langmead. Algorithms for Molecular Biology, July 2021.
[10.1186/s13015-021-00195-4]
Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design.
Adam Foster, Desi R. Ivanova, Ilyas Malik, Tom Rainforth. Preprint, July 2021.
[arxiv]
In silico proof of principle of machine learning-based antibody design at unconstrained scale.
Rahmad Akbar,Philippe A. Robert,Cédric R. Weber,Michael Widrich,Robert Frank,Milena Pavlović,Lonneke Scheffer,Maria Chernigovskaya,Igor Snapkov,Andrei Slabodkin,Brij Bhushan Mehta,Enkelejda Miho,Fridtjof Lund-Johansen,Jan Terje Andersen,Sepp Hochreiter, Ingrid Hobæk Haff,Günter Klambauer,Geir Kjetil Sandve,Victor Greiff. Preprint, July 2021.
[10.1101/2021.07.08.451480]
Deep diversification of an AAV capsid protein by machine learning.
Drew H. Bryant, Ali Bashir, Sam Sinai, Nina K. Jain, Pierce J. Ogden, Patrick F. Riley, George M. Church, Lucy J. Colwell & Eric D. Kelsic. Nature Biotechnology, February 2021.
[10.1038/s41587-020-00793-4]
Deep Uncertainty and the Search for Proteins.
Zelda Mariet, Ghassen Jerfel, Zi Wang, Christof Angermüller, David Belanger, Suhani Vora, Maxwell Bileschi, Lucy Colwell, D Sculley, Dustin Tran, Jasper Snoek. NeurIPS 2020 ML for Molecules Workshop, December 2020.
[pdf]
Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production.
Jonathan C. Greenhalgh, Sarah A. Fahlberg, Brian F. Pfleger, Philip A. Romero. Preprint, May 2021.
[10.1101/2021.05.21.445192]
Large-scale design and refinement of stable proteins using sequence-only models.
Jedediah M. Singer, Scott Novotney, Devin Strickland, Hugh K. Haddox, Nicholas Leiby, Gabriel J. Rocklin, Cameron M. Chow, Anindya Roy, Asim K. Bera, Francis C. Motta, … Eric Klavins. Preprint, March 2021.
[10.1101/2021.03.12.435185]
AdaLead: A simple and robust adaptive greedy search algorithm for sequence design.
Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, Eric D. Kelsic.
Preprint, October 2020.
[arxiv]
The NK Landscape as a Versatile Benchmark for Machine Learning Driven Protein Engineering.
Adam C. Mater, Mahakaran Sandhu, Colin Jackson. Preprint, October 2020.
[10.1101/2020.09.30.319780]
Learning with uncertainty for biological discovery and design.
Brian Hie, Bryan Bryson, Bonnie Berger. Preprint, August 2020.
[10.1101/2020.08.11.247072]
Population-Based Black-Box Optimization for Biological Sequence Design.
Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, David Dohan, Kevin Murphy, Lucy Colwell, D Sculley. ICML, July 2020.
[ICML]
Autofocused oracles for model-based design.
Clara Fannjiang, Jennifer Listgarten. Preprint, June 2020.
[arxiv]
Domain Extrapolation via Regret Minimization.
Wengong Jin, Regina Barzilay, Tommi Jaakkola. Preprint, June 2020.
[arxiv]
Fast differentiable DNA and protein sequence optimization for molecular design.
Johannes Linder, Georg Seelig. Preprint, May 2020.
[arxiv]
A Deep Dive into Machine Learning Models for Protein Engineering.
Yuting Xu, Deeptak Verma, Robert P Sheridan, Andy Liaw, Junshui Ma, Nicholas
Marshall, John McIntosh, Edward C. Sherer, Vladimir Svetnik, Jennifer Johnston. Journal of Chemical Information and Modeling, April 2020.
[10.1021/acs.jcim.0c00073]
Evolutionary context-integrated deep sequence modeling for protein engineering.
Yunan Luo, Lam Vo, Hantian Ding, Yufeng Su, Yang Liu, Wesley Wei Qian, Huimin Zhao, Jian Peng. Preprint, January 2020.
[10.1101/2020.01.16.908509]
Biological Sequence Design using Batched Bayesian Optimization.
David Belanger, Suhani Vora, Zelda Mariet, Ramya Deshpande, David Dohan, Christof Angermueller, Kevin Murphy, Olivier Chapelle, Lucy Colwell. NeurIPS Workshop on Machine Learning and the Physical Sciences, December 2019.
[ML4PS]
Model Inversion Networks for Model-Based Optimization.
Aviral Kumar, Sergey Levine
Preprint, December 2019.
[arxiv]
Interpreting mutational effects predictions, one substitution at a time.
C. K. Sruthi, Meher K. Prakash. bioRxiv, December 2019
[10.1101/867812]
A structure-based deep learning framework for protein engineering.
Raghav Shroff, Austin W. Cole, Barrett R. Morrow, Daniel J. Diaz, Isaac Donnell, Jimmy Gollihar, Andrew D. Ellington, Ross Thyer. Preprint, November 2019.
[10.1101/833905]
Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design.
Pierce J. Ogden, Eric D. Kelsic, Sam Sinai, George M. Church. Science, November 2019.
[10.1126/science.aaw2900]
Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics.
Claire N. Bedbrook, Kevin K. Yang, J. Elliott Robinson, Viviana Gradinaru, Frances H Arnold. Nature Methods, October 2019.
[10.1038/s41592-019-0583-8]
Preprint available on [bioRxiv]
Batched Stochastic Bayesian Optimization via Combinatorial Constraints Design.
Kevin K. Yang, Yuxin Chen, Alycia Lee, Yisong Yue. International Conference on Artificial Intelligence and Statistics (AISTATS), April 2019.
[arxiv] [PMLR]
Machine learning-assisted directed protein evolution with combinatorial libraries.
Zachary Wu, S. B. Jennifer Kan, Russell D. Lewis, Bruce J. Wittmann, Frances H. Arnold. PNAS, April 2019.
[10.1073/pnas.1901979116]
Conditioning by adaptive sampling for robust design.
David H. Brookes, Hahnbeom Park, Jennifer Listgarten. Preprint, January 2019.
[arxiv]
A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes.
Frédéric Cadet, Nicolas Fontaine, Guangyue Li, Joaquin Sanchis, Matthieu Ng Fuk Chong, Rudy Pandjaitan, Iyanar Vetrivel, Bernard Offmann, Manfred T. Reetz. Scientific Reports, November 2018.
[10.1038/s41598-018-35033-y]
Design by adaptive sampling.
David H. Brookes, Jennifer Listgarten. Preprint, October 2018.
[arxiv]
Machine-Learning-Guided Mutagenesis for Directed Evolution of Fluorescent Proteins.
Yutaka Saito, Misaki Oikawa, Hikaru Nakazawa, Teppei Niide, Tomoshi Kameda, Koji Tsuda, and Mitsuo Umetsu. ACS Synthetic Biology, August 2018.
[10.1021/acssynbio.8b00155]
Toward machine-guided design of proteins.
Surojit Biswas, Gleb Kuznetsov, Pierce J. Ogden, Nicholas J. Conway, Ryan P. Adams, George M. Church. Preprint, June 2018.
[10.1101/337154] [bioRxiv]
Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Architecture for Optimizing Protein Functions.
Anvita Gupta, James Zou. Preprint, April 2018.
[arxiv]
Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization.
Claire N. Bedbrook, Kevin K. Yang, Austin J. Rice, Viviana Gradinaru, Frances H. Arnold. PLOS Computational Biology, October 2017.
[10.1371/journal.pcbi.1005786]
Exploring sequence-function space of a poplar glutathione transferase using designed information-rich gene variants.
Yaman Musdal, Sridhar Govindarajan, Bengt Mannervik. Protein Engineering, Design, and Selection, August 2017.
[10.1093%2Fprotein%2Fgzx045]
Navigating the protein fitness landscape with Gaussian processes.
Philip A. Romero, Andreas Krause, Frances H. Arnold. PNAS, January 2013.
[10.1073/pnas.1215251110]
Engineering proteinase K using machine learning and synthetic genes.
Jun Liao, Manfred K. Warmuth, Sridhar Govindarajan, Jon E. Ness, Rebecca P Wang, Claes Gustafsson, Jeremy Minshull. BMC Biotechnology, March 2007.
[10.1186/1472-6750-7-16]
Improving catalytic function by ProSAR-driven enzyme evolution.
Richard J. Fox, S. Christopher Davis, Emily C. Mundorff, Lisa M. Newman, Vesna Gavrilovic, Steven K. Ma, Loleta M. Chung, Charlene Ching, Sarena Tam, Sheela Muley, John Grate, John Gruber, John C. Whitman, Roger A. Sheldon, Gjalt W. Huisman. Nature Biotechnology, February 2007.
[Nature Biotechnology]
Representation learning
Language models of protein sequences at the scale of evolution enable accurate structure prediction.
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. Preprint, July 2022.
[10.1101/2022.07.20.500902]
Advancing protein language models with linguistics: a roadmap for improved interpretability.
Mai Ha Vu, Rahmad Akbar, Philippe A. Robert, Bartlomiej Swiatczak, Victor Greiff, Geir Kjetil Sandve, Dag Trygve Truslew Haug. Preprint, July 2022.
[arxiv]
Self-supervised deep learning encodes high-resolution features of protein subcellular localization.
Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti & Loic A. Royer. Nature Methods, July 2022.
[10.1038/s41592-022-01541-z]
COLLAPSE: A representation learning framework for identification and characterization of protein structural sites.
Alexander Derry, Russ B. Altman. Preprint, July 2022.
[10.1101/2022.07.20.500713]
CoSP: Co-supervised pretraining of pocket and ligand.
Zhangyang Gao, Cheng Tan, Lirong Wu, Stan Z. Li. Preprint, June 2022.
[arxiv]
Pre-training Protein Models with Molecular Dynamics Simulations for Drug Binding.
Wu F, Zhang Q, Radev D, Wang Y, Jin X, Jiang Y, Li SZ, Niu Z. Preprint, June 2022.
[10.21203/rs.3.rs-1566483/v1]
Exploring evolution-based &-free protein language models as protein function predictors.
Mingyang Hu, Fajie Yuan, Kevin K. Yang, Fusong Ju, Jin Su, Hui Wang, Fei Yang, Qiuyang Ding. Preprint, June 2022.
[arxiv]
Masked inverse folding with sequence transfer for protein representation learning.
Kevin K. Yang, Niccolò Zanichelli, Hugh Yeh. Preprint, June 2022.
[10.1101/2022.05.25.493516]
Convolutions are competitive with transformers for protein sequence pretraining.
Kevin K. Yang, Alex X. Lu, Nicolo Fusi. Preprint, June 2022.
[10.1101/2022.05.19.492714]]
Evolutionary velocity with protein language models.
Brian L. Hie, Kevin K. Yang, Peter S. Kim. Cell Systems, April 2022.
[10.1016/j.cels.2022.01.003]
Identification of Enzymatic Active Sites with Unsupervised Language Modeling.
Loïc Kwate Dassi, Matteo Manica, Daniel Probst, Philippe Schwaller, Yves Gaetan Nana Teukam, Teodoro Laino. Preprint, November 2021.
[10.33774/chemrxiv-2021-m20gg]
Artificial Intelligence Guided Conformational Mining of Intrinsically Disordered Proteins.
Aayush Gupta, Souvik Dey, Huan-Xiang Zhou. Preprint, November 2021.
[10.1101/2021.11.21.469457]
Deciphering the language of antibodies using self-supervised learning.
Jinwoo Leem, Laura S. Mitchell, James H.R. Farmery, Justin Barton, Jacob D. Galson. Preprint, November 2021.
[10.1101/2021.11.10.468064]
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model.
Liang He, Shizhuo Zhang, Lijun Wu, Huanhuan Xia, Fusong Ju, He Zhang, Siyuan Liu, Yingce Xia, Jianwei Zhu, Pan Deng, Bin Shao, Tao Qin, Tie-Yan Liu. Preprint, October 2021.
[arxiv]
Neural Distance Embeddings for Biological Sequences.
Gabriele Corso, Rex Ying, Michal Pándy, Petar Veličković, Jure Leskovec, Pietro Liò. Preprint, September 2021.
[arxiv]
Biologically relevant transfer learning improves transcription factor binding prediction.
Gherman Novakovsky, Manu Saraswat, Oriol Fornes, Sara Mostafavi & Wyeth W. Wasserman. Genome Biology, September 2021.
[10.1186/s13059-021-02499-5]
Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure.
Sanaa Mansoor, Minkyung Baek, Umesh Madan, Eric Horvitz. Preprint, September 2021.
[10.1101/2021.09.01.458592]
Hydrogen bonds meet self-attention: all you need for general-purpose protein structure embedding.
Cheng Chen, Yuguo Zha, Daming Zhu, Kang Ning, Xuefeng Cui. Preprint, August 2021.
[10.1101/2021.01.31.428935]
Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning.
Alex X Lu, Amy X Lu, Iva Pritišanac, Taraneh Zarin, Julie D Forman-Kay, Alan M Moses. Preprint, July 2021.
[10.1101/2021.07.29.454330]
Inferring a Continuous Distribution of Atom Coordinates from Cryo-EM Images using VAEs.
Dan Rosenbaum, Marta Garnelo, Michal Zielinski, Charlie Beattie, Ellen Clancy, Andrea Huber, Pushmeet Kohli, Andrew W. Senior, John Jumper, Carl Doersch, S. M. Ali Eslami, Olaf Ronneberger, Jonas Adler. Preprint, June 2021..
[arxiv]
Pretraining model for biological sequence data.
Bosheng Song, Zimeng Li, Xuan Lin, Jianmin Wang, Tian Wang, Xiangzheng Fu. Briefings in Functional Genomics, May 2021.
[10.1093/bfgp/elab025]
ProteinBERT: A universal deep-learning model of protein sequence and function.
Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial. Preprint, May 2021.
[10.1101/2021.05.24.445464]
Random Embeddings and Linear Regression can Predict Protein Function.
Tianyu Lu, Alex X. Lu, Alan M. Moses. Preprint, April 2021.
[arxiv]
Combining evolutionary and assay-labelled data for protein fitness prediction.
Chloe Hsu, Hunter Nisonoff, Clara Fannjiang, Jennifer Listgarten. Preprint, March 2021.
[10.1101/2021.03.28.437402]
MSA Transformer.
Roshan Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu, Alexander Rives. Preprint, February 2021.
[10.1101/2021.02.12.430858]
Improving Generalizability of Protein Sequence Models with Data Augmentations.
Hongyu Shen, Layne C. Price, Taha Bahadori, Franziska Seeger. Preprint, February 2021.
[10.1101/2021.02.18.431877]
Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures.
Damianos P. Melidis, Wolfgang Nejdl. Algorithms, January 2021.
[10.3390/a14010028]
Adversarial Contrastive Pre-training for Protein Sequences.
Matthew B. A. McDermott, Brendan Yap, Harry Hsu, Di Jin, Peter Szolovits.
Preprint, January 2021.
[arxiv]
Fast end-to-end learning on protein surfaces.
Freyr Sverrisson, Jean Feydy, Bruno E. Correia, Michael M. Bronstein. Preprint, December 2020.
[10.1101/2020.12.28.424589]
Transformer protein language models are unsupervised structure learners.
Roshan Rao, Sergey Ovchinnikov, Joshua Meier, Alexander Rives, Tom Sercu. Preprint, December 2020.
[10.1101/2020.12.15.422761]
Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep): Protein Engineering as A Case Study.
Junwen Luo, Yi Cai, Jialin Wu, Hongmin Cai, Xiaofeng Yang, Zhanglin Lin. Preprint, December 2020.
[10.1101/2020.12.22.423916]
What is a meaningful representation of protein sequences?.
Nicki Skafte Detlefsen, Søren Hauberg, Wouter Boomsma. Preprint, November 2020.
[arxiv]
Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models.
Pascal Sturmfels, Jesse Vig, Ali Madani, Nazneen Fatema Rajani.
Preprint, November 2020.
[arxiv]
Fixed-Length Protein Embeddings using Contextual Lenses.
Amir Shanehsazzadeh, David Belanger, David Dohan.
Preprint, October 2020.
[arxiv]
Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis.
Serbulent Unsal, Heval Ataş, Muammer Albayrak, Kemal Turhan, Aybar C. Acar, Tunca Doğan. Preprint, October 2020.
[10.1101/2020.10.28.359828]
Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization.
Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, Alan Moses. Preprint, September 2020.
[10.1101/2020.09.04.283929]
ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost. Preprint, July 2020.
[10.1101/2020.07.12.199554]
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.
Amelia Villegas-Morcillo, Stavros Makrodimitris, Roeland van Ham, Angel M. Gomez, Victoria Sanchez, Marcel Reinders. Preprint, April 2020.
[10.1101/2020.04.07.028373]
Site2Vec: a reference frame invariant algorithm for vector embedding of protein-ligand binding sites.
Arnab Bhadra, Kalidas Y. Preprint, March 2020.
[arxiv]
Evolutionary context-integrated deep sequence modeling for protein engineering.
Yunan Luo, Lam Vo, Hantian Ding, Yufeng Su, Yang Liu, Wesley Wei Qian, Huimin Zhao, Jian Peng. Preprint, January 2020.
[10.1101/2020.01.16.908509]
Sequence representations and their utility for predicting protein-protein interactions.
Dhananjay Kimothi, Pravesh Biyani, James M Hogan. Preprint, December 2019.
[10.1101/2019.12.31.890699]
Language modelling for biological sequences – curated datasets and baselines.
Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Ole Winther, Henrik Nielsen. Preprint, December 2019.
[alrojo.github.io]
Deciphering protein evolution and fitness landscapes with latent space models
Xinqiang Ding, Zhengting Zou, Charles L. Brooks III. Nature Communications, December 2019.
[10.1038/s41467-019-13633-0]
End-to-end multitask learning, from protein language to protein features without alignments.
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Burkhard Rost. Preprint, December 2019.
[10.1101/864405]
Unified rational protein engineering with sequence-only deep representation learning.
Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, George M. Church. Nature Methods, October 2019
[10.1038/s41592-019-0598-1]
Structure-Based Function Prediction using Graph Convolutional Networks.
Vladimir Gligorijevic, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Kunghyun Cho, Tommi Vatanen, Daniel Berenberg, Bryn Taylor, Ian M. Fisk, Ramnik J. Xavier, Rob Knight, Richard Bonneau. Preprint, October 2019.
[0.1101/786236]
Modeling the language of life – Deep Learning Protein Sequences.
Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes, Burkhard Rost. Preprint, September 2019.
[10.1101/614313]
Augmenting Protein Network Embeddings with Sequence Information.
Hassan Kane, Mohamed K. Coulibali, Pelkins Ajanoh, Ali Abdallah. Preprint, August 2019.
[10.1101/730481]
Universal Deep Sequence Models for Protein Classification.
Nils Strodthoff, Patrick Wagner, Markus Wenzel, Wojciech Samek. Preprint, July 2019.
[10.1101/704874]
DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences.
Ehsaneddin Asgari, Nina Poerner, Alice C. McHardy, Mohammad R.K. Mofrad. Preprint, July 2019.
[10.1101/705426]
A Self-Consistent Sonification Method to Translate Amino Acid Sequences into Musical Compositions and Application in Protein Design Using Artificial Intelligence.
Chi-Hua Yu, Zhao Qin, Francisco J. Martin-Martinez, Markus J. Buehler. ACS Nano, June 2019.
[10.1021/acsnano.9b02180]
Evaluating Protein Transfer Learning with TAPE.
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song. Preprint, June 2019.
[arxiv]
Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins.
Julius Upmeier zu Belzen, Thore Bürgel, Stefan Holderbach, Felix Bubeck, Lukas Adam, Catharina Gandor, Marita Klein, Jan Mathony, Pauline Pfuderer, Lukas Platz, Moritz Przybilla, Max Schwendemann, Daniel Heid, Mareike Daniela Hoffmann, Michael Jendrusch, Carolin Schmelas, Max Waldhauer, Irina Lehmann, Dominik Niopek, Roland Eils. Nature Machine Intelligence, May 2019.
[Nature Machine Intelligence]
Modeling the Language of Life – Deep Learning Protein Sequences.
Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes, Burkhard Rost. Preprint, May 2019.
[10.1101/614313] [bioRxiv]
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences.
Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus. Preprint, April 2019.
[10.1101/622803] [bioRxiv]
Learning protein constitutive motifs from sequence data.
Jérôme Tubiana, Simona Cocco, Rémi Monasson. eLife, March 2019.
[10.7554/eLife.39397]
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).
Ehsaneddin Asgari, Alice C. McHardy, Mohammad R. K. Mofrad. Scientific Reports, March 2019.
[10.1038/s41598-019-38746-w]
Learning protein sequence embeddings using information from structure.
Tristan Bepler, Bonnie Berger. International Conference on Learning Representations, February 2019.
[ICLR]
Application of fourier transform and proteochemometrics principles to protein engineering.
Frédéric Cadet, Nicolas Fontaine, Iyanar Vetrivel, Matthieu Ng Fuk Chong, Olivier Savriama, Xavier Cadet, Philippe Charton. BMC Bioinformatics, October 2018.
[10.1186/s12859-018-2407-8]
Learned protein embeddings for machine learning.
Kevin K Yang, Zachary Wu, Claire N Bedbrook, Frances H Arnold Bioinformatics, August 2018
[10.1093/bioinformatics/bty178]
Deep Semantic Protein Representation for Annotation, Discovery, and Engineering.
Ariel S Schwartz, Gregory J Hannum, Zach R Dwiel, Michael E Smoot, Ana R Grant, Jason M Knight, Scott A Becker, Jonathan R Eads, Matthew C LaFave, Harini Eavani, Yinyin Liu, Arjun K Bansal, Toby H Richardson Preprint, July 2018
[10.1101/365965]
Improved Descriptors for the Quantitative Structure–Activity Relationship Modeling of Peptides and Proteins.
Mark H. Barley, Nicholas J. Turner, Royston Goodacre. Journal of Chemical Information and Modeling, January 2018.
[10.1021/acs.jcim.7b00488]
Variational auto-encoding of protein sequences.
Sam Sinai, Eric Kelsic, George M. Church, Martin A. Nowak Preprint, December 2017
[arxiv]
Predicting Protein Binding Affinity With Word Embeddings and Recurrent Neural Networks.
Carlo Mazzaferro. Preprint, April 2017.
[10.1101/128223] [bioRxiv]
dna2vec: Consistent vector representations of variable-length k-mers.
Patrick Ng Preprint, January 2017
[arxiv]
Distributed Representations for Biological Sequence Analysis.
Dhananjay Kimothi, Akshay Soni, Pravesh Biyani, James M. Hogan Preprint, August 2016
[arxiv]
ProFET: Feature engineering captures high-level protein functions.
Dan Ofer, Michal Linial. Bioinformatics, June 2015.
[10.1093/bioinformatics/btv345]
AAindex: amino acid index database, progress report 2008.
Shuichi Kawashima, Piotr Pokarowski, Maria Pokarowska, Andrzej Kolinski, Toshiaki Katayama, Minoru Kanehisa. Nucleic Acids Research, January 2008.
[10.1093/nar/gkm998]
Unsupervised variant prediction
Evotuning protocols for Transformer-based variant effect prediction on multi-domain proteins.
Hideki Yamaguchi, Yutaka Saito. Briefings in Bioinformatics, November 2021.
[10.1093/bib/bbab234]
Disease variant prediction with deep generative models of evolutionary data.
Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Joseph K Min, Kelly Brock, Yarin Gal, Debora S Marks. Nature, November 2021.
[10.1038/s41586-021-04043-8]
Language models enable zero-shot prediction of the effects of mutations on protein function.
Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, Alexander Rives. Preprint, July 2021.
[10.1101/2021.07.09.450648]
Unsupervised inference of protein fitness landscape from deep mutational scan.
Jorge Fernandez-de-Cossio-Diaz, Guido Uguzzoni, Andrea Pagnani. Preprint, March 2020.
[10.1101/2020.03.18.996595]
Deep generative models of genetic variation capture the effects of mutations.
Adam J. Riesselman, John B. Ingraham, Debora S. Marks Nature Methods, September 2018
[10.1038/s41592-018-0138-4]
Variational auto-encoding of protein sequences.
Sam Sinai, Eric Kelsic, George M. Church, Martin A. Nowak Preprint, December 2017
[arxiv]
Generative models
Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models.
Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, Jianzhu Ma. Preprint, July 2022.
[10.1101/2022.07.10.499510]
End-to-End deep structure generative model for protein design.
Boqiao Lai, Matthew McPartlon, Jinbo Xu. Preprint, July 2022.
[10.1101/2022.07.09.499440]
Hallucinating protein assemblies.
B. I. M. Wicky, L. F. Milles, A. Courbet, R. J. Ragotte, J. Dauparas, E. Kinfu, S. Tipps, R. D. Kibler, M. Baek, F. DiMaio, X. Li, L. Carter, A. Kang, H. Nguyen, A. K. Bera, D. Baker. Preprint, June 2022.
[10.1101/2022.06.09.493773]
ProGen2: Exploring the Boundaries of Protein Language Models.
Erik Nijkamp, Jeffrey Ruffolo, Eli N. Weinstein, Nikhil Naik, Ali Madani. Preprint, June 2022.
[arxiv]
Sampling the conformational landscapes of transporters and receptors with AlphaFold2.
Diego del Alamo, Davide Sala, Hassane S. Mchaourab, Jens Meiler. Preprint, November 2021.
[10.1101/2021.11.22.469536]
Benchmarking deep generative models for diverse antibody sequence design.
Igor Melnyk, Payel Das, Vijil Chenthamarakshan, Aurelie Lozano. Preprint, November 2021.
[arxiv]
Efficient generative modeling of protein sequences using simple autoregressive models.
Jeanne Trinquier, Guido Uguzzoni, Andrea Pagnani, Francesco Zamponi & Martin Weigt. Nature Communications, October 2021.
[10.1038/s41467-021-25756-4]
Navigating the amino acid sequence space between functional proteins using a deep learning framework.
Tristan Bitard-Feildel. PeerJ Computer Science, September 2021.
[10.7717/peerj-cs.684]
Ancestral Sequence Reconstruction for Co-evolutionary models.
Edwin Rodríguez Horta, Alejandro Lage-Castellanos, Roberto Mulet. Preprint, August 2021..
[arxiv]
AMaLa: Analysis of Directed Evolution Experiments via Annealed Mutational approximated Landscape.
Luca Sesta, Guido Uguzzoni, Jorge Fernandez-de-Cossio Diaz, Andrea Pagnani. International Journal of Molecular Sciences, August 2021.
[10.3390/ijms222010908]
Modeling sequence-space exploration and emergence of epistatic signals in protein evolution.
Matteo Bisardi, Juan Rodriguez-Rivas, Francesco Zamponi, Martin Weigt. Preprint, June 2021.
[arxiv]
Generative AAV capsid diversification by latent interpolation.
Sam Sinai, Nina Jain, George M Church, Eric D Kelsic. Preprint, April 2021.
[10.1101/2021.04.16.440236]
Protein design and variant prediction using autoregressive generative models.
Jung-Eun Shin, Adam Riesselman, Kollasch, Conor McMahon, Elana Simon, Chris Sander, Aashish Manglik, Andrew Kruse, Debora Marks. Nature Communications, April 2021.
[10.1038/s41467-021-22732-w]
Expanding functional protein sequence spaces using generative adversarial networks.
Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Elzbieta Rembeza, Jan Zrimec, Simona Poviloniene, Irmantas Rokaitis, Audrius Laurynenas, Wissam Abuajwa, Otto Savolainen, Rolandas Meskys, Martin K. M. Engqvist, Aleksej Zelezniak. Nature Machine Intelligence, March 2021.
[10.1038/s42256-021-00310-5]
Generating functional protein variants with variational autoencoders.
Alex Hawkins-Hooker, Florence Depardieu, Sebastien Baur, Guillaume Couairon, Arthur Chen, David Bikard. PLOS Computational Biology, February 2021.
[10.1371/journal.pcbi.1008736]
Generating novel protein sequences using Gibbs sampling of masked language models.
Sean R. Johnson, Sarah Monaco, Kenneth Massie, Zaid Syed. Preprint, January 2021.
[10.1101/2021.01.26.428322]
The structure-fitness landscape of pairwise relations in generative sequence models. Preprint, November 2020.
Dylan Marshall, Haobo Wang, Michael Stiffler, Justas Dauparas, Peter Koo, Sergey Ovchinnikov.
[10.1101/2020.11.29.402875]
请发表评论