Benchmarking of Open Omics Acceleration Framework on AWS
Benchmarking of Open Omics Acceleration Framework on AWS
Step by step commands to benchmark OpenOmics framework on AWS
- Log in to your AWS account
- Launch a virtual machine with EC2
- Choose an Amazon Machine Image (AMI): Select any 64-bit (x86) AMI (say, Ubuntu Server 20.04 LTS) from “Quick Start”.
- Choose an Instance Type.
- Configure Instance.
- Add Storage: You can add storage based on the workload requirements
- Configure security group
- Review and launch the instance (ensure you have/create a key to ssh login in next step)
- Use SSH to login to the machine after the instance is up and running
- $ ssh -i
username@Public-DNS
- $ ssh -i
- The logged in AWS instance machine is now ready to use – you can download OpenOmics workloads and related datasets to be executed on this instance.
Machine configurations used for benchmarking
AWS c5.12xlarge: 1-instance AWS c5.12xlarge: 48 vCPUs (Cascade Lake), 96 GB total memory, ucode: 0x500320a, Ubuntu 22.04, 5.15.0-1004-aws
AWS m5.12xlarge: 1-instance AWS c5.12xlarge: 48 vCPUs (Cascade Lake), 192 GB total memory, ucode: 0x500320a, Ubuntu 22.04, 5.15.0-1004-aws
AWS c6i.16xlarge: 1-instance AWS c6i.16xlarge: 64 vCPUs (Ice Lake), 128 GB total memory, ucode: 0xd000331, Ubuntu 22.04, 5.15.0-1004-aws
AWS m6i.16xlarge: 1-instance AWS m6i.16xlarge: 64 vCPUs (Ice Lake), 256 GB total memory, ucode: 0xd000331, Ubuntu 22.04, 5.15.0-1004-aws
Step by step instructions to benchmark baseline (bwa-mem) and OpenOmics BWA-MEM (bwa-mem2) on m5.12xlarge and m6i.16xlarge instances of AWS
Step 1: Download datasets
Download reference genome: Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
Download dataset #1: ERR194147 (Paired-End) from s3://sra-pub-run-odp/sra/ERR194147/ERR194147
Download dataset #2: ERR1955529 (Paired-End) from s3://sra-pub-run-odp/sra/ERR1955529/ERR1955529
Download dataset #3: ERR3239276 (Paired-End) from s3://sra-pub-run-odp/sra/ERR3239276/ERR3239276
Step 2: Download and compile baseline (BWA v0.7.17) and create index
curl -L https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2 | tar jxf -
cd bwa-0.7.17 && make
./bwa index hs_asm38/Homo_sapiens_assembly38.fasta
cd ..
Step 3: Download OpenOmics BWA-MEM (BWA-MEM2 v2.2.1) and create index
curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.2.1/bwa-mem2-2.2.1_x64-linux.tar.bz2 | tar jxf -
cd bwa-mem2-2.2.1_x86-linux && ./bwa-mem2 index hs_asm38/Homo_sapiens_assembly38.fasta
cd ..
Step 4: Running baseline and OpenOmics BWA-MEM
The m5.12xlarge instance has 24 cores, while the m6i.16xlarge instance has 32 cores. The memory available on the instances allows for using both the threads on each core.
Run baseline BWA-MEM
cd bwa-0.7.17
Sample commands shown below:
For m5.12xlarge:
./bwa mem -t 48 hs_asm38/Homo_sapiens_assembly38.fasta ERR194147_1.fastq ERR194147_2.fastq > ERR194147.out.sam
For m6i.16xlarge:
./bwa mem -t 64 hs_asm38/Homo_sapiens_assembly38.fasta ERR194147_1.fastq ERR194147_2.fastq > ERR194147.out.sam
Run OpenOmics BWA-MEM
cd ../bwa-mem2-2.2.1_x86-linux
Sample commands shown below:
For m5.12xlarge:
./bwa-mem2 mem -t 48 hs_asm38/Homo_sapiens_assembly38.fasta ERR194147_1.fastq ERR194147_2.fastq > ERR194147.out.sam
For m6i.16xlarge:
./bwa-mem2 mem -t 64 hs_asm38/Homo_sapiens_assembly38.fasta ERR194147_1.fastq ERR194147_2.fastq > ERR194147.out.sam
Step by step instructions to benchmark baseline (minimap2) and OpenOmics minimap2 (mm2-fast) on c5.12xlarge and c6i.16xlarge instances of AWS
Step 1: Download datasets
Download reference genome
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
Link to download HG002 ONT Guppy 3.6.0 dataset:
https://precision.fda.gov/challenges/10/view
File name: HG002_GM24385_1_2_3_Guppy_3.6.0_prom.fastq.gz
Link to download HG002 HiFi 14kb-15kb dataset:
https://precision.fda.gov/challenges/10/view
File name: HG002_35x_PacBio_14kb-15kb.fastq.gz
Download HG002 CLR dataset from s3://giab/data_indexes/AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_subreads_fasta_10082018
Download hap2 assembly dataset-
wget https://zenodo.org/record/4393631/files/NA24385.HiFi.hifiasm-0.12.hap2.fa.gz
Step 2: Download and compile baseline (minimap2 v0.2.22)
git clone https://github.com/lh3/minimap2.git -b v2.22
cd minimap2 && make
Step 3: Run baseline minimap2
./minimap2 -ax [preset] [ref-seq] [read-seq] -t 48 > minimap2output
Example command for ONT HG002 dataset:
./minimap2 -ax map-ont GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz HG002_ONT.fastq -t 48 > minimap2output
Step 3: Download and compile OpenOmics minimap2 (mm2-fast)
git clone --recursive https://github.com/bwa-mem2/mm2-fast.git -b mm2-fast-v2.22 mm2-fast-contrib
cd mm2-fast-contrib && make multi
Step 4: Create index for OpenOmics minimap2
./build rmi.sh path-to-ref-seq <preset flags>
<preset flags> are as follows:
ONT: map-ont
HiFi: map-hifi
CLR: map-pb
Assembly: asm5
Example: Create OpenOmics minimap2 index for ONT datasets for GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
./build rmi.sh GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz map-ont
Step 5: Run OpenOmics minimap2
./mm2-fast -ax [preset] [ref-seq] [read-seq] -t [num_threads] > mm2-fastoutput
Example command to run HG002 ONT dataset on c5.12xlarge
./mm2-fast -ax map-ont GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz HG002_ONT.fastq -t 48 > mm2-fastoutput
Example command to run HG002 ONT dataset on c6i.16xlarge
./mm2-fast -ax map-ont GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz HG002_ONT.fastq -t 64 > mm2-fastoutput
Step by step instructions to benchmark OpenOmics for ATAC-Seq data analysis on multiple c5.24xlarge and c6i.32xlarge instances of AWS
Update
apt-get update
Install software
apt-get install -y git
apt-get install -y libcurl4-openssl-dev
apt-get install -y hdf5-tools
apt-get install -y rsync
apt-get install -y make
apt-get install -y gcc
apt-get install -y libblas-dev
apt-get install -y python3.7 python3-pip
ln -nsf /usr/bin/python3.7 /usr/bin/python
Anaconda Environment
conda create --name Atac python=3.7
conda activate Atac
Clone the libxsmm repository and set library path
cd /home/
git clone https://github.com/libxsmm/libxsmm.git
cd /home/libxsmm
git checkout b3da2b1bed9d27f9d6bae91a683f8cf76fe299b5
make -j # Use AVX=2 for AVX2 and AVX=3 for AVX512
cd /home/
export LD_LIBRARY_PATH=/home/libxsmm/lib/
Clone atacworks repo
git clone --branch v0.2.0 https://github.com/clara-parabricks/AtacWorks.git
Clone the OpenOmics version
git clone https://github.com/IntelLabs/Trans-Omics-Acceleration-Library.git
Apply patch
cd /home/AtacWorks/
git apply /home/Trans-Omics-Acceleration-Library/applications/ATAC-Seq/AtacWorks_cpu_optimization_patch.patch
Install python packages
python3.7 -m pip install -r requirements-base.txt
python3.7 -m pip install torch torchvision torchaudio
python3.7 -m pip install -r requirements-macs2.txt
(Optional) Install torch-ccl
# Install torch-ccl
# git clone --branch v1.1.0 https://github.com/intel/torch-ccl.git && cd torch-ccl
# git submodule sync
# git submodule update --init --recursive
# python3.7 setup.py install
Install 1D convolution module
cd /home/libxsmm/samples/deeplearning/conv1dopti_layer/Conv1dOpti-extension/
python setup.py install
Install AtacWorks folder ans set path
cd /home/AtacWorks/
python3.7 -m pip install .
atacworks=/home/AtacWorks/
Download data to train
wget https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/noisy_data/dsc.1.Mono.50.cutsites.smoothed.200.bw
wget https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.bw
wget https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak
Download file conversion binaries and set path
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig /home/
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bigWigToBedGraph /home/
export PATH="$PATH:/home/" >> /home/.bashrc # set the path for bedGraphToBigWig binaries
Data preprocessing
python $atacworks/scripts/peak2bw.py \
--input dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak \
--sizes $atacworks/data/reference/hg19.chrom.sizes \
--out_dir ./ \
--skip 1
python $atacworks/scripts/get_intervals.py \
--sizes $atacworks/data/reference/hg19.auto.sizes \
--intervalsize 50000 \
--out_dir ./ \
--val chr20 \
--holdout chr10
python $atacworks/scripts/bw2h5.py \
--noisybw dsc.1.Mono.50.cutsites.smoothed.200.bw \
--cleanbw dsc.Mono.2400.cutsites.smoothed.200.bw \
--cleanpeakbw dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw \
--intervals training_intervals.bed \
--out_dir ./ \
--prefix Mono.50.2400.train \
--pad 5000 \
--nonzero
python $atacworks/scripts/bw2h5.py \
--noisybw dsc.1.Mono.50.cutsites.smoothed.200.bw \
--cleanbw dsc.Mono.2400.cutsites.smoothed.200.bw \
--cleanpeakbw dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw \
--intervals val_intervals.bed \
--out_dir ./ \
--prefix Mono.50.2400.val \
--pad 5000
Set affinity and threads
export KMP_AFFINITY=compact,1,0,granularity=fine
export LD_PRELOAD=/home/libtcmalloc.so # Copy these files in the /home folder first
export LD_PRELOAD=/home/libjemalloc.so
export OMP_NUM_THREADS=31 # (Available cores (N) - 1)
Training run (Single Socket)
# In numactl command, "-C 1-31" is for running on cores 1 to 31.
# General case for an N core machine is "-C 1-(N-1)".
# Keep batch size in config/train_config.yaml to a multiple of (N-1) for optimum performance
numactl --membind 0 -C 1-31 python $atacworks/scripts/main.py train \
--config configs/train_config.yaml \
--config_mparams configs/model_structure.yaml \
--files_train $atacworks/Mono.50.2400.train.h5 \
--val_files $atacworks/Mono.50.2400.val.h5
Option - Another option to use on machines without NUMA — “taskset -c 1-31 python …”
Training run (Multiple Sockets/Nodes)
export OMP_NUM_THREADS=30 # (Available cores (N) - 2)
# 1. change line 23 in configs/train_config.yaml with the following
# dist-backend: 'gloo'
# 2. change line 22 in configs/train_config.yaml with the following
# dist-backend: 'gloo'
# 3. Keep batch size (bs) in config/train_config.yaml to a multiple of (N-2) for optimum performance.
# Batch size gets multiplied by number of socket. Hence, if bs=30, no. of sockets = 16 than batch size = 30*16 = 480
# 4. Comment line the following line (79,80) in AtacWorks/claragenomics/dl4atac/utils.py and reinstall AtacWorks using "pip install ." command.
# if (os.path.islink(latest_symlink)):
# os.remove(latest_symlink)
# 5. Run the following Slurm batch script that uses MPI commands.
sbatch Batchfile_CPU.slurm
Cleanup
Terminate all EC2 instances used to run benchmarks to avoid incurring charges.