OpenOmics ProteinMPNN
OpenOmics ProteinMPNN
ProteinMPNN is a widely used deep learning-based method for protein sequence design. It generates the amino acid sequences given protein structure backbone, enable design of de novo proteins and optimizations of existing ones.
Here, we present OpenOmics ProteinMPNN, a highly optimized version for modern CPUs with exact same functionality and accuracy as the original ProteinMPNN. OpenOmics ProteinMPNN also supports lower precision (bfloat16) computations.
Using Docker
Build
git clone https://github.com/IntelLabs/Open-Omics-Acceleration-Framework.git
cd Open-Omics-Acceleration-Framework/applications/ProteinMPNN
docker build --build-arg http_proxy=<proxy_url> --build-arg https_proxy=<proxy_url> -t pmpnn .
Run
The main script for ProteinMPNN protein_mpnn_run.py
can be run as
docker run -it -v <output_dir>:/outputs -v <input_dir>:/input pmpnn:latest python protein_mpnn_run.py
Note: Various input parameters to protein_mpnn_run.py are described in the original readme below
Examples:
Various ProteinMPNN example scripts are present in examples/ and can be run as follows:
Simple monomer example
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_1.py
Simple multi-chain example
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_2.py
Directly from the .pdb path
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_3.py
Return score only (model’s uncertainty)
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_3_score_only.py
Return score only (model’s uncertainty) loading sequence from fasta files
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_3_score_only_from_fasta.py
Fix some residue positions
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_4.py
Specify which positions to design
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_4_non_fixed.py
Tie some positions together (symmetry)
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_5.py
Homooligomer example
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_6.py
Return sequence unconditional probabilities (PSSM like)
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_7.py
Add amino acid bias
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_8.py
Use PSSM bias when designing sequences
docker run -it -v <output_dir>:/outputs pmpnn:latest python examples/script_example_pssm.py
All the above scripts are parameterizable, for example:
mkdir -p ./outputs
docker run -it -v ./output:/outputs pmpnn:latest python examples/script_example_1.py --input /ProteinMPNN/inputs/PDB_monomers/pdbs --num_seq_per_target 10 --sampling_temp 0.1 --seed 37 --batch_size 1 --precision bfloat16
Using source code
Install
source setup_proteinmpnn.sh
Install jemalloc for better performance
git clone --branch 5.3.0 https://github.com/jemalloc/jemalloc.git
cd jemalloc && bash autogen.sh --prefix=<install_location> && make install
cd ..
export LD_LIBRARY_PATH=<install_location>/lib:$LD_LIBRARY_PATH
Run
Simple monomer example:
cd examples/
python script_example_1.py --output <output-dir> --input <input-pdb-directory> --precision <bfloat16/float32>
## example
mkdir -p ./output
cd examples
python script_example_1.py --output ../output/ --input ../inputs/PDB_monomers/pdbs/ --precision float32
OpenOmics ProteinMPNN README ends here
Original ProteinMPNN README follows:
ProteinMPNN
Read ProteinMPNN paper.
To run ProteinMPNN clone this github repo and install Python>=3.0, PyTorch, Numpy.
Full protein backbone models: vanilla_model_weights/v_48_002.pt, v_48_010.pt, v_48_020.pt, v_48_030.pt
, soluble_model_weights/v_48_010.pt, v_48_020.pt
.
CA only models: ca_model_weights/v_48_002.pt, v_48_010.pt, v_48_020.pt
. Enable flag --ca_only
to use these models.
Helper scripts: helper_scripts
- helper functions to parse PDBs, assign which chains to design, which residues to fix, adding AA bias, tying residues etc.
Code organization:
protein_mpnn_run.py
- the main script to initialialize and run the model.protein_mpnn_utils.py
- utility functions for the main script.examples/
- simple code examples.inputs/
- input PDB files for examplesoutputs/
- outputs from examplescolab_notebooks/
- Google Colab examples-
training/
- code and data to retrain the modelInput flags for
protein_mpnn_run.py
: ``` argparser.add_argument(“–suppress_print”, type=int, default=0, help=”0 for False, 1 for True”) argparser.add_argument(“–ca_only”, action=”store_true”, default=False, help=”Parse CA-only structures and use CA-only models (default: false)”) argparser.add_argument(“–path_to_model_weights”, type=str, default=””, help=”Path to model weights folder;”) argparser.add_argument(“–model_name”, type=str, default=”v_48_020”, help=”ProteinMPNN model name: v_48_002, v_48_010, v_48_020, v_48_030; v_48_010=version with 48 edges 0.10A noise”) argparser.add_argument(“–use_soluble_model”, action=”store_true”, default=False, help=”Flag to load ProteinMPNN weights trained on soluble proteins only.”) argparser.add_argument(“–seed”, type=int, default=0, help=”If set to 0 then a random seed will be picked;”) argparser.add_argument(“–save_score”, type=int, default=0, help=”0 for False, 1 for True; save score=-log_prob to npy files”) argparser.add_argument(“–path_to_fasta”, type=str, default=””, help=”score provided input sequence in a fasta format; e.g. GGGGGG/PPPPS/WWW for chains A, B, C sorted alphabetically and separated by /”) argparser.add_argument(“–save_probs”, type=int, default=0, help=”0 for False, 1 for True; save MPNN predicted probabilites per position”) argparser.add_argument(“–score_only”, type=int, default=0, help=”0 for False, 1 for True; score input backbone-sequence pairs”) argparser.add_argument(“–conditional_probs_only”, type=int, default=0, help=”0 for False, 1 for True; output conditional probabilities p(s_i given the rest of the sequence and backbone)”) argparser.add_argument(“–conditional_probs_only_backbone”, type=int, default=0, help=”0 for False, 1 for True; if true output conditional probabilities p(s_i given backbone)”) argparser.add_argument(“–unconditional_probs_only”, type=int, default=0, help=”0 for False, 1 for True; output unconditional probabilities p(s_i given backbone) in one forward pass”) argparser.add_argument(“–backbone_noise”, type=float, default=0.00, help=”Standard deviation of Gaussian noise to add to backbone atoms”) argparser.add_argument(“–num_seq_per_target”, type=int, default=1, help=”Number of sequences to generate per target”) argparser.add_argument(“–batch_size”, type=int, default=1, help=”Batch size; can set higher for titan, quadro GPUs, reduce this if running out of GPU memory”) argparser.add_argument(“–max_length”, type=int, default=200000, help=”Max sequence length”) argparser.add_argument(“–sampling_temp”, type=str, default=”0.1”, help=”A string of temperatures, 0.2 0.25 0.5. Sampling temperature for amino acids. Suggested values 0.1, 0.15, 0.2, 0.25, 0.3. Higher values will lead to more diversity.”) argparser.add_argument(“–out_folder”, type=str, help=”Path to a folder to output sequences, e.g. /home/out/”) argparser.add_argument(“–pdb_path”, type=str, default=’’, help=”Path to a single PDB to be designed”) argparser.add_argument(“–pdb_path_chains”, type=str, default=’’, help=”Define which chains need to be designed for a single PDB “) argparser.add_argument(“–jsonl_path”, type=str, help=”Path to a folder with parsed pdb into jsonl”) argparser.add_argument(“–chain_id_jsonl”,type=str, default=’’, help=”Path to a dictionary specifying which chains need to be designed and which ones are fixed, if not specied all chains will be designed.”) argparser.add_argument(“–fixed_positions_jsonl”, type=str, default=’’, help=”Path to a dictionary with fixed positions”) argparser.add_argument(“–omit_AAs”, type=list, default=’X’, help=”Specify which amino acids should be omitted in the generated sequence, e.g. ‘AC’ would omit alanine and cystine.”) argparser.add_argument(“–bias_AA_jsonl”, type=str, default=’’, help=”Path to a dictionary which specifies AA composion bias if neededi, e.g. {A: -1.1, F: 0.7} would make A less likely and F more likely.”) argparser.add_argument(“–bias_by_res_jsonl”, default=’’, help=”Path to dictionary with per position bias.”) argparser.add_argument(“–omit_AA_jsonl”, type=str, default=’’, help=”Path to a dictionary which specifies which amino acids need to be omited from design at specific chain indices”) argparser.add_argument(“–pssm_jsonl”, type=str, default=’’, help=”Path to a dictionary with pssm”) argparser.add_argument(“–pssm_multi”, type=float, default=0.0, help=”A value between [0.0, 1.0], 0.0 means do not use pssm, 1.0 ignore MPNN predictions”) argparser.add_argument(“–pssm_threshold”, type=float, default=0.0, help=”A value between -inf + inf to restric per position AAs”) argparser.add_argument(“–pssm_log_odds_flag”, type=int, default=0, help=”0 for False, 1 for True”) argparser.add_argument(“–pssm_bias_flag”, type=int, default=0, help=”0 for False, 1 for True”) argparser.add_argument(“–tied_positions_jsonl”, type=str, default=’’, help=”Path to a dictionary with tied positions”)
-----------------------------------------------------------------------------------------------------
For example to make a conda environment to run ProteinMPNN:
* `conda create --name mlfold` - this creates conda environment called `mlfold`
* `source activate mlfold` - this activate environment
* `conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch` - install pytorch following steps from https://pytorch.org/
-----------------------------------------------------------------------------------------------------
These are provided `examples/`:
* `submit_example_1.sh` - simple monomer example
* `submit_example_2.sh` - simple multi-chain example
* `submit_example_3.sh` - directly from the .pdb path
* `submit_example_3_score_only.sh` - return score only (model's uncertainty)
* `submit_example_3_score_only_from_fasta.sh` - return score only (model's uncertainty) loading sequence from fasta files
* `submit_example_4.sh` - fix some residue positions
* `submit_example_4_non_fixed.sh` - specify which positions to design
* `submit_example_5.sh` - tie some positions together (symmetry)
* `submit_example_6.sh` - homooligomer example
* `submit_example_7.sh` - return sequence unconditional probabilities (PSSM like)
* `submit_example_8.sh` - add amino acid bias
* `submit_example_pssm.sh` - use PSSM bias when designing sequences
-----------------------------------------------------------------------------------------------------
Output example:
3HTN, score=1.1705, global_score=1.2045, fixed_chains=[‘B’], designed_chains=[‘A’, ‘C’], model_name=v_48_020, git_hash=015ff820b9b5741ead6ba6795258f35a9c15e94b, seed=37 NMYSYKKIGNKYIVSINNHTEIVKALNAFCKEKGILSGSINGIGAIGELTLRFFNPKTKAYDDKTFREQMEISNLTGNISSMNEQVYLHLHITVGRSDYSALAGHLLSAIQNGAGEFVVEDYSERISRTYNPDLGLNIYDFER/NMYSYKKIGNKYIVSINNHTEIVKALNAFCKEKGILSGSINGIGAIGELTLRFFNPKTKAYDDKTFREQMEISNLTGNISSMNEQVYLHLHITVGRSDYSALAGHLLSAIQNGAGEFVVEDYSERISRTYNPDLGLNIYDFER T=0.1, sample=1, score=0.7291, global_score=0.9330, seq_recovery=0.5736 NMYSYKKIGNKYIVSINNHTEIVKALKKFCEEKNIKSGSVNGIGSIGSVTLKFYNLETKEEELKTFNANFEISNLTGFISMHDNKVFLDLHITIGDENFSALAGHLVSAVVNGTCELIVEDFNELVSTKYNEELGLWLLDFEK/NMYSYKKIGNKYIVSINNHTDIVTAIKKFCEDKKIKSGTINGIGQVKEVTLEFRNFETGEKEEKTFKKQFTISNLTGFISTKDGKVFLDLHITFGDENFSALAGHLISAIVDGKCELIIEDYNEEINVKYNEELGLYLLDFNK T=0.1, sample=2, score=0.7414, global_score=0.9355, seq_recovery=0.6075 NMYKYKKIGNKYIVSINNHTEIVKAIKEFCKEKNIKSGTINGIGQVGKVTLRFYNPETKEYTEKTFNDNFEISNLTGFISTYKNEVFLHLHITFGKSDFSALAGHLLSAIVNGICELIVEDFKENLSMKYDEKTGLYLLDFEK/NMYKYKKIGNKYVVSINNHTEIVEALKAFCEDKKIKSGTVNGIGQVSKVTLKFFNIETKESKEKTFNKNFEISNLTGFISEINGEVFLHLHITIGDENFSALAGHLLSAVVNGEAILIVEDYKEKVNRKYNEELGLNLLDFNL
* `score` - average over residues that were designed negative log probability of sampled amino acids * `global score` - average over all residues in all chains negative log probability of sampled/fixed amino acids * `fixed_chains` - chains that were not designed (fixed) * `designed_chains` - chains that were redesigned * `model_name/CA_model_name` - model name that was used to generate results, e.g. `v_48_020` * `git_hash` - github version that was used to generate outputs * `seed` - random seed * `T=0.1` - temperature equal to 0.1 was used to sample sequences * `sample` - sequence sample number 1, 2, 3...etc -----------------------------------------------------------------------------------------------------
@article{dauparas2022robust, title={Robust deep learning–based protein sequence design using ProteinMPNN}, author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others}, journal={Science}, volume={378}, number={6615},
pages={49–56}, year={2022}, publisher={American Association for the Advancement of Science} } ``` —————————————————————————————————–