Open-Omics-ESM

Open-Omics-ESM3 is an optimized version of the Evolutionary Scale Modeling ESM3 toolkit, designed for modern CPUs. It enhances the performance of various ESM3 modules by enabling lower-precision computations (bf16) for improved efficiency.

🛠️ Building the Docker Image

Run the following command to build the Docker image:

docker build -t esm3_image .

🌐 Building Behind a Proxy

If you’re working in a corporate or institutional environment, your internet access may be routed through a proxy server. In such cases, Docker may not be able to download dependencies during the build process unless you explicitly configure proxy settings.

To build the Docker image with proxy settings, you can use the –build-arg option to pass your proxy configuration:

docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy --build-arg no_proxy=$no_proxy -t esm3_image .

🔒 Note: Make sure the environment variables http_proxy, https_proxy, and no_proxy are correctly set in your shell before running this command.

For more details, refer to the official Docker documentation: Docker behind proxy

Setting Up Environment Variables and Directories

To ensure the application runs smoothly, set up the necessary directories and environment variables. Follow these steps:

Step 1: Export Environment Variables

Define environment variables for your folder paths. Replace <your input folder> ,<your output folder>, and <your Model folder> with the desired paths:

export INPUT=$PWD/<your input folder>
export OUTPUT=$PWD/<your output folder>
export MODELS=$PWD/<your Model folder>

Step 2: Create the Required Directories

Here’s an example using standardized folder names. These commands will create the directories, set the environment variables, and adjust permissions:

mkdir -p input output models
export INPUT=$PWD/input   
export OUTPUT=$PWD/output     
export MODELS=$PWD/models 
chmod a+w $MODELS $OUTPUT

Step 3: 🔑 Hugging Face Token Setup

To access models from the Hugging Face Hub, you’ll need an API token with “Read” permissions. Follow the steps below to create one:

  1. Go to the Hugging Face token management page:
    👉 https://huggingface.co/settings/tokens

  2. Click “New token”.

  3. Set a name for your token (e.g., esm-access).

  4. Under Role, select Read.

  5. Click “Generate token”, then copy the token.

  6. Set the token as an environment variable before running the script:

    export HUGGING_FACE_HUB_TOKEN=your_token_here
    

💡 Your token is personal — keep it secure and do not share it in public code or repositories.

Running

Information on flags

--protein_complex flag enables prediction for multi-chain protein complexes using multi-chain FASTA and PDB inputs. When this flag is enabled, FASTA files containing multiple sequences are interpreted as representing different chains within the protein complex.

--bf16 flag accelerates inference performance by utilizing bfloat16 precision.

ESMC - Logits Embeddings: Extracts protein sequence embeddings using logits for downstream analysis

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESMC_logits_embedding_task.py /input/some_proteins.fasta /output/ --bf16

ESM3 - Logits Embeddings: Generates embeddings with ESM3 for sequence representation and analysis

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESM3_logits_embedding_task.py /input/some_proteins.fasta /output/ --bf16

ESM3 - Folding: Predicts the 3D structure of proteins from amino acid sequences using ESM3

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESM3_folding_task.py /input/sample_sequence.fasta /output/ --bf16

ESM3 - Inverse Folding: Designs protein sequences that fold into a given 3D structure

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESM3_inversefold_task.py /input/5YH2.pdb /output/ --bf16

ESM3 - Function Prediction: Predicts protein function from structural and sequence data

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESM3_function_prediction_task.py /input/1utn.pdb /output/ --bf16

ESM3 - Prompt Sequence: Generates protein sequences based on user-provided prompts for design tasks

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESM3_prompt_sequence.py /input/prompt.fasta /output/ --bf16

ESM3 - Chain of Thought: Uses reasoning-based approaches to analyze and interpret protein data

#example
docker run -it \
    -e HF_TOKEN="<your_huggingface_token>" \
    -v $MODELS:/models \
    -v $INPUT:/input \
    -v $OUTPUT:/output \
    esm3_image:latest \
    python scripts/ESM3_chain_of_thought.py /input/1utn.csv /output/ --bf16

The original README content of ESM3 follows.

This repository contains flagship protein models for EvolutionaryScale, as well as access to the API. ESM3 is our flagship multimodal protein generative model, and can be used for generation and prediction tasks. ESM C is our best protein representation learning model, and can be used to embed protein sequences.

Installation

To get started with ESM, install the python library using pip:

pip install esm

ESM 3

ESM3 is a frontier generative model for biology, able to jointly reason across three fundamental biological properties of proteins: sequence, structure, and function. These three data modalities are represented as tracks of discrete tokens at the input and output of ESM3. You can present the model with a combination of partial inputs across the tracks, and ESM3 will provide output predictions for all the tracks.

ESM3 is a generative masked language model. You can prompt it with partial sequence, structure, and function keywords, and iteratively sample masked positions until all positions are unmasked. This iterative sampling is what the .generate() function does.

ESM3 Diagram

The ESM3 architecture is highly scalable due to its transformer backbone and all-to-all reasoning over discrete token sequences. At its largest scale, ESM3 was trained with 1.07e24 FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters. Learn more by reading the blog post and the pre-print (Hayes et al., 2024).

ESM3-open, with 1.4B parameters, is the smallest and fastest model in the family.

Quickstart for ESM3-open

pip install esm

The weights are stored on HuggingFace Hub under HuggingFace/EvolutionaryScale/esm3.

from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig

# Will instruct you how to get an API key from huggingface hub, make one with "Read" permission.
login()

# This will download the model weights and instantiate the model on your machine.
model: ESM3InferenceClient = ESM3.from_pretrained("esm3-open").to("cuda") # or "cpu"

# Generate a completion for a partial Carbonic Anhydrase (2vvb)
prompt = "___________________________________________________DQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTYRLIQFHFHWGSLDGQGSEHTVDKKKYAAELHLVHWNTKYGDFGKAVQQPDGLAVLGIFLKVGSAKPGLQKVVDVLDSIKTKGKSADFTNFDPRGLLPESLDYWTYPGSLTTPP___________________________________________________________"
protein = ESMProtein(sequence=prompt)
# Generate the sequence, then the structure. This will iteratively unmask the sequence track.
protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8, temperature=0.7))
# We can show the predicted structure for the generated sequence.
protein = model.generate(protein, GenerationConfig(track="structure", num_steps=8))
protein.to_pdb("./generation.pdb")
# Then we can do a round trip design by inverse folding the sequence and recomputing the structure
protein.sequence = None
protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
protein.coordinates = None
protein = model.generate(protein, GenerationConfig(track="structure", num_steps=8))
protein.to_pdb("./round_tripped.pdb")

Congratulations, you just generated your first proteins with ESM3!

EvolutionaryScale Forge: Access to larger ESM3 models

You can access all scales of ESM3 models EvolutionaryScale Forge.

We encourage users to interact with the Forge API through the python esm library instead of the command line. The python interface enables you to interactively load proteins, build prompts, and inspect generated proteins with the ESMProtein and config classes used to interact with the local model.

In any example script you can replace a local ESM3 model with a Forge API client:

# Instead of loading the model locally on your machine:
model: ESM3InferenceClient = ESM3.from_pretrained("esm3_sm_open_v1").to("cuda") # or "cpu"
# just replace the line with this:
model: ESM3InferenceClient = esm.sdk.client("esm3-medium-2024-08", token="<your forge token>")
# and now you're interfacing with the model running on our remote servers.
...

and the exact same code will work. This enables a seamless transition from smaller and faster models, to our largest and most capable protein language models for protein design work.

ESM3 Example Usage

Check out our tutorials to learn how to use ESM3.

ESM C

ESM Cambrian is a parallel model family to our flagship ESM3 generative models. While ESM3 focuses on controllable generation of proteins, ESM C focuses on creating representations of the underlying biology of proteins.

ESM C is designed as a drop-in replacement for ESM2 and comes with major performance benefits. The 300M parameter ESM C delivers similar performance to ESM2 650M with dramatically reduced memory requirements and faster inference. The 600M parameter ESM C rivals the 3B parameter ESM2 and approaches the capabilities of the 15B model, delivering frontier performance with far greater efficiency. The 6B parameter ESM C outperforms the best ESM2 models by a wide margin.

ESM C can be run locally, via the Forge API or through AWS SageMaker.

Quickstart for ESM C Open Models

When running the code below, a pytorch model will be instantiated locally on your machine, with the weights downloaded from the HuggingFace hub.

from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig

protein = ESMProtein(sequence="AAAAA")
client = ESMC.from_pretrained("esmc_300m").to("cuda") # or "cpu"
protein_tensor = client.encode(protein)
logits_output = client.logits(
   protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
)
print(logits_output.logits, logits_output.embeddings)

To use Flash Attention with the open weights:

Simply install flash-attn package, which will enable Flash Attention automatically:

pip install flash-attn --no-build-isolation

You can also disable flash-attn by passing use_flash_attn=False to utils like ESMC_300M_202412.

ESM C 6B via Forge API

Apply for access and copy the API token from the console by first visiting Forge.

With the code below, a local python client talks to the model inference server hosted by EvolutionaryScale.

from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, LogitsConfig

# Apply for forge access and get an access token
forge_client = ESM3ForgeInferenceClient(model="esmc-6b-2024-12", url="https://forge.evolutionaryscale.ai", token="<your forge token>")
protein_tensor = forge_client.encode(protein)
logits_output = forge_client.logits(
   protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
)
print(logits_output.logits, logits_output.embeddings)

Remember to replace <your forge token> with your actual Forge access token.

Forge Batch Executor

For jobs that require processing multiple inputs, the Forge Batch Executor provides a streamlined and way to execute them concurrently and efficiently while respecting rate limits and adapting to request latency.

from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, LogitsConfig
from esm.sdk import batch_executor

def embed_sequence(client: ESM3ForgeInferenceClient, sequence: str) -> LogitsOutput:
    protein = ESMProtein(sequence=sequence)
    protein_tensor = client.encode(protein)
    if isinstance(protein_tensor, ESMProteinError):
        raise protein_tensor
    output = client.logits(protein_tensor, LogitsConfig(sequence=True, return_embeddings=True))
    return output

sequences = ["A", "AA", "AAA"]
client =  ESM3ForgeInferenceClient(model="esmc-6b-2024-12", url="https://forge.evolutionaryscale.ai", token="<your forge token>")

# Usage Example:
# To execute a batch job, wrap your function inside the batch executor context manager.
# Syntax:
# with batch_executor() as executor:
#     outputs = executor.execute_batch(user_func=<your_function>, **kwargs)

with batch_executor() as executor:
    outputs = executor.execute_batch(user_func=embed_sequence, model=client, sequence=sequences)

ESM C via SageMaker for Commercial Use

ESM C models are also available on Amazon SageMaker under the Cambrian Inference Clickthrough License Agreement. Under this license agreement, models are available for broad use for commercial entities.

You will need an admin AWS access to an AWS account to follow these instructions. To deploy, first we need to deploy the AWS package:

  1. Find the ESM C model version you want to subscribe to. All of our offerings are visible here.
  2. Click the name of the model version you are interested in, review pricing information and the end user license agreement (EULA), then click “Continue to Subscribe”.
  3. Once you have subscribed, you should be able to see our model under your marketplace subscriptions.
  4. Click the product name and then from the “Actions” dropdown select “Configure”.
  5. You will next see the “Configure and Launch” UI. There are multiple deployment paths - we recommend using “AWS CloudFormation”.
  6. The default value for “Service Access” may or may not work. We recommend clicking “Create and use a new service role”.
  7. Click “Launch CloudFormation Template”. This takes 15 to 25 minutes depending on model size.
  8. On the “Quick create stack” page, ensure the stack name and endpoint names are not already used. You can check existing stack names here and existing endpoint names here.

The SageMaker deployment of the model now lives on a dedicated GPU instance inside your AWS environment, and will be billed directly to your AWS account. Make sure to remember to shut down the instance after you stop using it. Find the CloudFormation stack you created here, select it, and then click “Delete” to clean up all resources.

After creating the endpoint, you can create a SageMaker client and use it the same way as a Forge client. They share the same API. The local python client talks to the SageMaker endpoint you just deployed, which runs on an instance with a GPU to run model inference.

Ensure that the code below runs in an environment that has AWS credentials available for the account which provisioned SageMaker resources. Learn more about general AWS credential options here.

from esm.sdk.sagemaker import ESM3SageMakerClient
from esm.sdk.api import ESMProtein, LogitsConfig

sagemaker_client = ESM3SageMakerClient(
   # E.g. "Endpoint-ESMC-6B-1"
   endpoint_name=SAGE_ENDPOINT_NAME,
   # E.g. "esmc-6b-2024-12". Same model names as in Forge.
   model=MODEL_NAME,
)

protein = ESMProtein(sequence="AAAAA")
protein_tensor = sagemaker_client.encode(protein)
logits_output = sagemaker_client.logits(
   protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
)
print(logits_output.logits, logits_output.embeddings)

ESM C Example Usage

Check out our tutorials to learn how to use ESM C.

Responsible Development

EvolutionaryScale is a public benefit company. Our mission is to develop artificial intelligence to understand biology for the benefit of human health and society, through partnership with the scientific community, and open, safe, and responsible research. Inspired by the history of our field as well as new principles and recommendations, we have created a Responsible Development Framework to guide our work towards our mission with transparency and clarity.

The core tenets of our framework are

  • We will communicate the benefits and risks of our research
  • We will proactively and rigorously evaluate the risk of our models before public deployment
  • We will adopt risk mitigation strategies and precautionary guardrails
  • We will work with stakeholders in government, policy, and civil society to keep them informed

With this in mind, we have performed a variety of mitigations for esm3-sm-open-v1, detailed in our paper

Licenses

The code and model weights of ESM3 and ESM C are available under a mixture of non-commercial and permissive commercial licenses. For complete license details, see LICENSE.md.

Citations

If you use ESM in your work, please cite one of the following:

ESM3

@article {hayes2024simulating,
	author = {Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J. and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q. and Deaton, Jonathan and Wiggert, Marius and Badkundri, Rohil and Shafkat, Irhum and Gong, Jun and Derry, Alexander and Molina, Raul S. and Thomas, Neil and Khan, Yousuf A. and Mishra, Chetan and Kim, Carolyn and Bartie, Liam J. and Nemeth, Matthew and Hsu, Patrick D. and Sercu, Tom and Candido, Salvatore and Rives, Alexander},
	title = {Simulating 500 million years of evolution with a language model},
	year = {2024},
	doi = {10.1101/2024.07.01.600583},
	URL = {https://doi.org/10.1101/2024.07.01.600583},
	journal = {bioRxiv}
}

ESM C

@misc{esm2024cambrian,
  author = ,
  title = {ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning},
  year = {2024},
  publisher = {EvolutionaryScale Website},
  url = {https://evolutionaryscale.ai/blog/esm-cambrian},
  urldate = {2024-12-04}
}

ESM Github (Code / Weights)

@software{evolutionaryscale_2024,
  author = ,
  title = {evolutionaryscale/esm},
  year = {2024},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.14219303},
  URL = {https://doi.org/10.5281/zenodo.14219303}
}